AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

On Momentum-Based Gradient Methods for Bilevel Optimization with Nonconvex Lower-Level

arXiv.org Artificial IntelligenceNov-18-2023

Bilevel optimization is a popular two-level hierarchical optimization, which has been widely applied to many machine learning tasks such as hyperparameter learning, meta learning and continual learning. Although many bilevel optimization methods recently have been developed, the bilevel methods are not well studied when the lower-level problem is nonconvex. To fill this gap, in the paper, we study a class of nonconvex bilevel optimization problems, where both upper-level and lower-level problems are nonconvex, and the lower-level problem satisfies Polyak-{\L}ojasiewicz (PL) condition. We propose an efficient momentum-based gradient bilevel method (MGBiO) to solve these deterministic problems. Meanwhile, we propose a class of efficient momentum-based stochastic gradient bilevel methods (MSGBiO and VR-MSGBiO) to solve these stochastic problems. Moreover, we provide a useful convergence analysis framework for our methods. Specifically, under some mild conditions, we prove that our MGBiO method has a sample (or gradient) complexity of $O(\epsilon^{-2})$ for finding an $\epsilon$-stationary solution of the deterministic bilevel problems (i.e., $\|\nabla F(x)\|\leq \epsilon$), which improves the existing best results by a factor of $O(\epsilon^{-1})$. Meanwhile, we prove that our MSGBiO and VR-MSGBiO methods have sample complexities of $\tilde{O}(\epsilon^{-4})$ and $\tilde{O}(\epsilon^{-3})$, respectively, in finding an $\epsilon$-stationary solution of the stochastic bilevel problems (i.e., $\mathbb{E}\|\nabla F(x)\|\leq \epsilon$), which improves the existing best results by a factor of $\tilde{O}(\epsilon^{-3})$. Extensive experimental results on bilevel PL game and hyper-representation learning demonstrate the efficiency of our algorithms. This paper commemorates the mathematician Boris Polyak (1935 -2023).

algorithm, inequality, optimization, (14 more...)

arXiv.org Artificial Intelligence

2303.03944

Country:

Asia > China > Jiangsu Province > Nanjing (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

A Whole New Ball Game: A Primal Accelerated Method for Matrix Games and Minimizing the Maximum of Smooth Functions

Carmon, Yair, Jambulapati, Arun, Jin, Yujia, Sidford, Aaron

arXiv.org Artificial IntelligenceNov-17-2023

We design algorithms for minimizing $\max_{i\in[n]} f_i(x)$ over a $d$-dimensional Euclidean or simplex domain. When each $f_i$ is $1$-Lipschitz and $1$-smooth, our method computes an $\epsilon$-approximate solution using $\widetilde{O}(n \epsilon^{-1/3} + \epsilon^{-2})$ gradient and function evaluations, and $\widetilde{O}(n \epsilon^{-4/3})$ additional runtime. For large $n$, our evaluation complexity is optimal up to polylogarithmic factors. In the special case where each $f_i$ is linear -- which corresponds to finding a near-optimal primal strategy in a matrix game -- our method finds an $\epsilon$-approximate solution in runtime $\widetilde{O}(n (d/\epsilon)^{2/3} + nd + d\epsilon^{-2})$. For $n>d$ and $\epsilon=1/\sqrt{n}$ this improves over all existing first-order methods. When additionally $d = \omega(n^{8/11})$ our runtime also improves over all known interior point methods. Our algorithm combines three novel primitives: (1) A dynamic data structure which enables efficient stochastic gradient estimation in small $\ell_2$ or $\ell_1$ balls. (2) A mirror descent algorithm tailored to our data structure implementing an oracle which minimizes the objective over these balls. (3) A simple ball oracle acceleration framework suitable for non-Euclidean geometry.

algorithm, data structure, inequality, (17 more...)

arXiv.org Artificial Intelligence

2311.10886

Country:

North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre:

Research Report (0.50)
Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

INTAGS: Interactive Agent-Guided Simulation

Wei, Song, Coletta, Andrea, Vyetrenko, Svitlana, Balch, Tucker

arXiv.org Artificial IntelligenceNov-17-2023

In many applications involving multi-agent system (MAS), it is imperative to test an experimental (Exp) autonomous agent in a high-fidelity simulator prior to its deployment to production, to avoid unexpected losses in the real-world. Such a simulator acts as the environmental background (BG) agent(s), called agent-based simulator (ABS), aiming to replicate the complex real MAS. However, developing realistic ABS remains challenging, mainly due to the sequential and dynamic nature of such systems. To fill this gap, we propose a metric to distinguish between real and synthetic multi-agent systems, which is evaluated through the live interaction between the Exp and BG agents to explicitly account for the systems' sequential nature. Specifically, we characterize the system/environment by studying the effect of a sequence of BG agents' responses to the environment state evolution and take such effects' differences as MAS distance metric; The effect estimation is cast as a causal inference problem since the environment evolution is confounded with the previous environment state. Importantly, we propose the Interactive Agent-Guided Simulation (INTAGS) framework to build a realistic ABS by optimizing over this novel metric. To adapt to any environment with interactive sequential decision making agents, INTAGS formulates the simulator as a stochastic policy in reinforcement learning. Moreover, INTAGS utilizes the policy gradient update to bypass differentiating the proposed metric such that it can support non-differentiable operations of multi-agent environments. Through extensive experiments, we demonstrate the effectiveness of INTAGS on an equity stock market simulation example. We show that using INTAGS to calibrate the simulator can generate more realistic market data compared to the state-of-the-art conditional Wasserstein Generative Adversarial Network approach.

agent, bg agent, rollout, (16 more...)

arXiv.org Artificial Intelligence

2309.01784

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

Pinpoint, Not Criticize: Refining Large Language Models via Fine-Grained Actionable Feedback

Xu, Wenda, Deutsch, Daniel, Finkelstein, Mara, Juraska, Juraj, Zhang, Biao, Liu, Zhongtao, Wang, William Yang, Li, Lei, Freitag, Markus

arXiv.org Artificial IntelligenceNov-15-2023

Recent improvements in text generation have leveraged human feedback to improve the quality of the generated output. However, human feedback is not always available, especially during inference. In this work, we propose an inference time optimization method FITO to use fine-grained actionable feedback in the form of error type, error location and severity level that are predicted by a learned error pinpoint model for iterative refinement. FITO starts with an initial output, then iteratively incorporates the feedback via a refinement model that generates an improved output conditioned on the feedback. Given the uncertainty of consistent refined samples at iterative steps, we formulate iterative refinement into a local search problem and develop a simulated annealing based algorithm that balances exploration of the search space and optimization for output quality. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA) and topical summarization. We observe 0.8 and 0.7 MetricX gain on Chinese-English and English-German translation, 4.5 and 1.8 ROUGE-L gain at long form QA and topic summarization respectively, with a single iteration of refinement. With our simulated annealing algorithm, we see further quality improvements, including up to 1.7 MetricX improvements over the baseline approach.

algorithm, feedback model, translation, (14 more...)

arXiv.org Artificial Intelligence

2311.09336

Country:

North America > United States > California > Los Angeles County > Pasadena (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.04)
North America > Canada > Ontario > Toronto (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

Accelerating Toeplitz Neural Network with Constant-time Inference Complexity

Qin, Zhen, Zhong, Yiran

arXiv.org Artificial IntelligenceNov-15-2023

Toeplitz Neural Networks (TNNs) have exhibited outstanding performance in various sequence modeling tasks. They outperform commonly used Transformer-based models while benefiting from log-linear space-time complexities. On the other hand, State Space Models (SSMs) achieve lower performance than TNNs in language modeling but offer the advantage of constant inference complexity. In this paper, we aim to combine the strengths of TNNs and SSMs by converting TNNs to SSMs during inference, thereby enabling TNNs to achieve the same constant inference complexities as SSMs. To accomplish this, we formulate the conversion process as an optimization problem and provide a closed-form solution. We demonstrate how to transform the target equation into a Vandermonde linear system problem, which can be efficiently solved using the Discrete Fourier Transform (DFT). Notably, our method requires no training and maintains numerical stability. It can be also applied to any LongConv-based model. To assess its effectiveness, we conduct extensive experiments on language modeling tasks across various settings. Additionally, we compare our method to other gradient-descent solutions, highlighting the superior numerical stability of our approach. The source code is available at https://github.com/OpenNLPLab/ETSC-Exact-Toeplitz-to-SSM-Conversion.

complexity, inference, representation, (15 more...)

arXiv.org Artificial Intelligence

2311.08756

Country:

Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (1.00)

Technology:

Information Technology > Data Science > Data Quality > Data Transformation (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Flexible numerical optimization with ensmallen

Curtin, Ryan R., Edel, Marcus, Prabhu, Rahul Ganesh, Basak, Suryoday, Lou, Zhihao, Sanderson, Conrad

arXiv.org Artificial IntelligenceNov-15-2023

This report provides an introduction to the ensmallen numerical optimization library, as well as a deep dive into the technical details of how it works. The library provides a fast and flexible C++ framework for mathematical optimization of arbitrary user-supplied functions. A large set of pre-built optimizers is provided, including many variants of Stochastic Gradient Descent and Quasi-Newton optimizers. Several types of objective functions are supported, including differentiable, separable, constrained, and categorical objective functions. Implementation of a new optimizer requires only one method, while a new objective function requires typically only one or two C++ methods. Through internal use of C++ template metaprogramming, ensmallen provides support for arbitrary user-supplied callbacks and automatic inference of unsupplied methods without any runtime overhead. Empirical comparisons show that ensmallen outperforms other optimization frameworks (such as Julia and SciPy), sometimes by large margins. The library is available at https://ensmallen.org and is distributed under the permissive BSD license.

callback, evaluatewithgradient, optimizer, (14 more...)

arXiv.org Artificial Intelligence

2003.04103

Country:

Oceania > Australia (0.04)
North America > United States > Texas (0.04)
North America > United States > Montana (0.04)
(7 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)

Add feedback

Gradients Look Alike: Sensitivity is Often Overestimated in DP-SGD

Thudi, Anvith, Jia, Hengrui, Meehan, Casey, Shumailov, Ilia, Papernot, Nicolas

arXiv.org Machine LearningNov-15-2023

Differentially private stochastic gradient descent (DP-SGD) is the canonical approach to private deep learning. While the current privacy analysis of DP-SGD is known to be tight in some settings, several empirical results suggest that models trained on common benchmark datasets leak significantly less privacy for many datapoints. Yet, despite past attempts, a rigorous explanation for why this is the case has not been reached. Is it because there exist tighter privacy upper bounds when restricted to these dataset settings, or are our attacks not strong enough for certain datapoints? In this paper, we provide the first per-instance (i.e., ``data-dependent") DP analysis of DP-SGD. Our analysis captures the intuition that points with similar neighbors in the dataset enjoy better data-dependent privacy than outliers. Formally, this is done by modifying the per-step privacy analysis of DP-SGD to introduce a dependence on the distribution of model updates computed from a training dataset. We further develop a new composition theorem to effectively use this new per-step analysis to reason about an entire training run. Put all together, our evaluation shows that this novel DP-SGD analysis allows us to now formally show that DP-SGD leaks significantly less privacy for many datapoints (when trained on common benchmarks) than the current data-independent guarantee. This implies privacy attacks will necessarily fail against many datapoints if the adversary does not have sufficient control over the possible training datasets.

artificial intelligence, machine learning, mini-batch size, (14 more...)

arXiv.org Machine Learning

2307.0031

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report > New Finding (0.87)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

Non-Uniform Smoothness for Gradient Descent

Berahas, Albert S., Roberts, Lindon, Roosta, Fred

arXiv.org Artificial IntelligenceNov-14-2023

The analysis of gradient descent-type methods typically relies on the Lipschitz continuity of the objective gradient. This generally requires an expensive hyperparameter tuning process to appropriately calibrate a stepsize for a given problem. In this work we introduce a local first-order smoothness oracle (LFSO) which generalizes the Lipschitz continuous gradients smoothness condition and is applicable to any twice-differentiable function. We show that this oracle can encode all relevant problem information for tuning stepsizes for a suitably modified gradient descent method and give global and local convergence results. We also show that LFSOs in this modified first-order method can yield global linear convergence rates for non-strongly convex problems with extremely flat minima, and thus improve over the lower bound on rates achievable by general (accelerated) first-order methods.

algorithm 1, lfso, stepsize, (14 more...)

arXiv.org Artificial Intelligence

2311.08615

Country:

Oceania > Australia > Queensland (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.83)

Add feedback

Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback

Jordan, Michael I., Lin, Tianyi, Zhou, Zhengyuan

arXiv.org Artificial IntelligenceNov-14-2023

Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions: (1) in the single-agent setting, it achieves an optimal regret of $\Theta(\log T)$ for strongly convex cost functions; and (2) in the multi-agent setting of strongly monotone games, with each agent employing OGD, we obtain last-iterate convergence of the joint action to a unique Nash equilibrium at an optimal rate of $\Theta(\frac{1}{T})$. While these finite-time guarantees highlight its merits, OGD has the drawback that it requires knowing the strong convexity/monotonicity parameters. In this paper, we design a fully adaptive OGD algorithm, \textsf{AdaOGD}, that does not require a priori knowledge of these parameters. In the single-agent setting, our algorithm achieves $O(\log^2(T))$ regret under strong convexity, which is optimal up to a log factor. Further, if each agent employs \textsf{AdaOGD} in strongly monotone games, the joint action converges in a last-iterate sense to a unique Nash equilibrium at a rate of $O(\frac{\log^3 T}{T})$, again optimal up to log factors. We illustrate our algorithms in a learning version of the classical newsvendor problem, where due to lost sales, only (noisy) gradient feedback can be observed. Our results immediately yield the first feasible and near-optimal algorithm for both the single-retailer and multi-retailer settings. We also extend our results to the more general setting of exp-concave cost functions and games, using the online Newton step (ONS) algorithm.

algorithm, nash equilibrium, optimization, (14 more...)

arXiv.org Artificial Intelligence

2310.14085

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report > New Finding (0.54)

Industry: Retail (0.69)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

Resetting the Optimizer in Deep RL: An Empirical Study

Asadi, Kavosh, Fakoor, Rasool, Sabach, Shoham

arXiv.org Artificial IntelligenceNov-14-2023

We focus on the task of approximating the optimal value function in deep reinforcement learning. This iterative process is comprised of solving a sequence of optimization problems where the loss function changes per iteration. The common approach to solving this sequence of problems is to employ modern variants of the stochastic gradient descent algorithm such as Adam. These optimizers maintain their own internal parameters such as estimates of the first-order and the second-order moments of the gradient, and update them over time. Therefore, information obtained in previous iterations is used to solve the optimization problem in the current iteration. We demonstrate that this can contaminate the moment estimates because the optimization landscape can change arbitrarily from one iteration to the next one. To hedge against this negative effect, a simple idea is to reset the internal parameters of the optimizer when starting a new iteration. We empirically investigate this resetting idea by employing various optimizers in conjunction with the Rainbow algorithm. We demonstrate that this simple modification significantly improves the performance of deep RL on the Atari benchmark.

average episode score, rainbow, training frame, (15 more...)

arXiv.org Artificial Intelligence

2306.17833

Country:

North America > Canada > Alberta (0.14)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback