AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Failures and Successes of Cross-Validation for Early-Stopped Gradient Descent

Patil, Pratik, Wu, Yuchen, Tibshirani, Ryan J.

arXiv.org Machine LearningFeb-26-2024

We analyze the statistical properties of generalized cross-validation (GCV) and leave-one-out cross-validation (LOOCV) applied to early-stopped gradient descent (GD) in high-dimensional least squares regression. We prove that GCV is generically inconsistent as an estimator of the prediction risk of early-stopped GD, even for a well-specified linear model with isotropic features. In contrast, we show that LOOCV converges uniformly along the GD trajectory to the prediction risk. Our theory requires only mild assumptions on the data distribution and does not require the underlying regression function to be linear. Furthermore, by leveraging the individual LOOCV errors, we construct consistent estimators for the entire prediction error distribution along the GD trajectory and consistent estimators for a wide class of error functionals. This in particular enables the construction of pathwise prediction intervals based on GD iterates that have asymptotically correct nominal coverage conditional on the training data.

assumption, equation, exp, (15 more...)

arXiv.org Machine Learning

2402.16793

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
North America > United States > Pennsylvania (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (0.81)

Add feedback

Stable Training of Normalizing Flows for High-dimensional Variational Inference

Andrade, Daniel

arXiv.org Machine LearningFeb-26-2024

Variational inference with normalizing flows (NFs) is an increasingly popular alternative to MCMC methods. In particular, NFs based on coupling layers (Real NVPs) are frequently used due to their good empirical performance. In theory, increasing the depth of normalizing flows should lead to more accurate posterior approximations. However, in practice, training deep normalizing flows for approximating high-dimensional posterior distributions is often infeasible due to the high variance of the stochastic gradients. In this work, we show that previous methods for stabilizing the variance of stochastic gradient descent can be insufficient to achieve stable training of Real NVPs. As the source of the problem, we identify that, during training, samples often exhibit unusual high values. As a remedy, we propose a combination of two methods: (1) soft-thresholding of the scale in Real NVPs, and (2) a bijective soft log transformation of the samples. We evaluate these and other previously proposed modification on several challenging target distributions, including a high-dimensional horseshoe logistic regression model. Our experiments show that with our modifications, stable training of Real NVPs for posteriors with several thousand dimensions is possible, allowing for more accurate marginal likelihood estimation via importance sampling. Moreover, we evaluate several common training techniques and architecture choices and provide practical advise for training NFs for high-dimensional variational inference.

gaussian, loft, student, (13 more...)

arXiv.org Machine Learning

2402.16408

Country:

North America > United States (0.14)
Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)

Add feedback

Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Wu, Jingfeng, Bartlett, Peter L., Telgarsky, Matus, Yu, Bin

arXiv.org Machine LearningFeb-24-2024

We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.

eos phase, probability, stepsize, (14 more...)

arXiv.org Machine Learning

2402.15926

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Iteration and Stochastic First-order Oracle Complexities of Stochastic Gradient Descent using Constant and Decaying Learning Rates

Imaizumi, Kento, Iiduka, Hideaki

arXiv.org Machine LearningFeb-23-2024

The performance of stochastic gradient descent (SGD), which is the simplest first-order optimizer for training deep neural networks, depends on not only the learning rate but also the batch size. They both affect the number of iterations and the stochastic first-order oracle (SFO) complexity needed for training. In particular, the previous numerical results indicated that, for SGD using a constant learning rate, the number of iterations needed for training decreases when the batch size increases, and the SFO complexity needed for training is minimized at a critical batch size and that it increases once the batch size exceeds that size. Here, we study the relationship between batch size and the iteration and SFO complexities needed for nonconvex optimization in deep learning with SGD using constant or decaying learning rates and show that SGD using the critical batch size minimizes the SFO complexity. We also provide numerical comparisons of SGD with the existing first-order optimizers and show the usefulness of SGD using a critical batch size. Moreover, we show that measured critical batch sizes are close to the sizes estimated from our theoretical results.

batch size, decay, sfo complexity, (14 more...)

arXiv.org Machine Learning

2402.15344

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Japan (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Accelerating Convergence of Stein Variational Gradient Descent via Deep Unfolding

Kawamura, Yuya, Takabe, Satoshi

arXiv.org Machine LearningFeb-23-2024

Stein variational gradient descent (SVGD) is a prominent particle-based variational inference method used for sampling a target distribution. SVGD has attracted interest for application in machine-learning techniques such as Bayesian inference. In this paper, we propose novel trainable algorithms that incorporate a deep-learning technique called deep unfolding,into SVGD. This approach facilitates the learning of the internal parameters of SVGD, thereby accelerating its convergence speed. To evaluate the proposed trainable SVGD algorithms, we conducted numerical simulations of three tasks: sampling a one-dimensional Gaussian mixture, performing Bayesian logistic regression, and learning Bayesian neural networks. The results show that our proposed algorithms exhibit faster convergence than the conventional variants of SVGD.

c-dusvgd, step size, svgd, (14 more...)

arXiv.org Machine Learning

2402.15125

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report > New Finding (0.55)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Mudjacking: Patching Backdoor Vulnerabilities in Foundation Models

Liu, Hongbin, Reiter, Michael K., Gong, Neil Zhenqiang

arXiv.org Artificial IntelligenceFeb-22-2024

Foundation model has become the backbone of the AI ecosystem. In particular, a foundation model can be used as a general-purpose feature extractor to build various downstream classifiers. However, foundation models are vulnerable to backdoor attacks and a backdoored foundation model is a single-point-of-failure of the AI ecosystem, e.g., multiple downstream classifiers inherit the backdoor vulnerabilities simultaneously. In this work, we propose Mudjacking, the first method to patch foundation models to remove backdoors. Specifically, given a misclassified trigger-embedded input detected after a backdoored foundation model is deployed, Mudjacking adjusts the parameters of the foundation model to remove the backdoor. We formulate patching a foundation model as an optimization problem and propose a gradient descent based method to solve it. We evaluate Mudjacking on both vision and language foundation models, eleven benchmark datasets, five existing backdoor attacks, and thirteen adaptive backdoor attacks. Our results show that Mudjacking can remove backdoor from a foundation model while maintaining its utility.

backdoor attack, downstream classifier, foundation model, (14 more...)

arXiv.org Artificial Intelligence

2402.14977

Country: Asia > Nepal (0.04)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology > Security & Privacy (0.78)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

How Transformers Learn Causal Structure with Gradient Descent

Nichani, Eshaan, Damian, Alex, Lee, Jason D.

arXiv.org Machine LearningFeb-22-2024

The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structure via gradient-based training algorithms remains poorly understood. To better understand this process, we introduce an in-context learning task that requires learning latent causal structure. We prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens. As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, we prove that transformers learn an induction head (Olsson et al., 2022). We confirm our theoretical findings by showing that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.

matrix, residual stream, transformer, (12 more...)

arXiv.org Machine Learning

2402.14735

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Convergence Acceleration of Markov Chain Monte Carlo-based Gradient Descent by Deep Unfolding

Hagiwara, Ryo, Takabe, Satoshi

arXiv.org Machine LearningFeb-21-2024

The proposed solver is based on the Ohzeki method that combines Markov-chain Monte-Carlo (MCMC) and gradient descent, and its step sizes are trained by minimizing a loss function. In the training process, we propose a sampling-based gradient estimation that substitutes auto-differentiation with a variance estimation, thereby circumventing the failure of back propagation due to the non-differentiability of MCMC. The numerical results for a few COPs demonstrated that the proposed solver significantly accelerated the convergence speed compared with the original Ohzeki method. Combinatorial optimization problems (COPs) comprising discrete variables are considered hard to solve exactly in polynomial time, which relates to the well-known P vs. NP problem. Along with deterministic approximation algorithms, samplers such as Markovchain Monte-Carlo (MCMC) have been applied to COPs. However, the convergence time for obtaining reasonable approximate solutions is long.

duom, ohzeki method, step size, (13 more...)

arXiv.org Machine Learning

2402.13608

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.61)

Add feedback

Private Gradient Descent for Linear Regression: Tighter Error Bounds and Instance-Specific Uncertainty Estimation

Brown, Gavin, Dvijotham, Krishnamurthy, Evans, Georgina, Liu, Daogao, Smith, Adam, Thakurta, Abhradeep

arXiv.org Artificial IntelligenceFeb-20-2024

We provide an improved analysis of standard differentially private gradient descent for linear regression under the squared error loss. Under modest assumptions on the input, we characterize the distribution of the iterate at each time step. Our analysis leads to new results on the algorithm's accuracy: for a proper fixed choice of hyperparameters, the sample complexity depends only linearly on the dimension of the data. This matches the dimension-dependence of the (non-private) ordinary least squares estimator as well as that of recent private algorithms that rely on sophisticated adaptive gradient-clipping schemes (Varshney et al., 2022; Liu et al., 2023). Our analysis of the iterates' distribution also allows us to construct confidence intervals for the empirical optimizer which adapt automatically to the variance of the algorithm on a particular data set. We validate our theorems through experiments on synthetic data.

algorithm, private gradient descent, probability, (10 more...)

arXiv.org Artificial Intelligence

2402.13531

Genre: Research Report > New Finding (0.87)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.73)

Add feedback

On the Stability of Gradient Descent for Large Learning Rate

Crăciun, Alexandru, Ghoshdastidar, Debarghya

arXiv.org Artificial IntelligenceFeb-20-2024

There currently is a significant interest in understanding the Edge of Stability (EoS) phenomenon, which has been observed in neural networks training, characterized by a non-monotonic decrease of the loss function over epochs, while the sharpness of the loss (spectral norm of the Hessian) progressively approaches and stabilizes around 2/(learning rate). Reasons for the existence of EoS when training using gradient descent have recently been proposed -- a lack of flat minima near the gradient descent trajectory together with the presence of compact forward-invariant sets. In this paper, we show that linear neural networks optimized under a quadratic loss function satisfy the first assumption and also a necessary condition for the second assumption. More precisely, we prove that the gradient descent map is non-singular, the set of global minimizers of the loss function forms a smooth manifold, and the stable minima form a bounded subset in parameter space. Additionally, we prove that if the step-size is too big, then the set of initializations from which gradient descent converges to a critical point has measure zero.

eigenvalue, gradient descent, measure zero, (16 more...)

arXiv.org Artificial Intelligence

2402.13108

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback