AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

An Alternative Surrogate Loss for PGD-based Adversarial Testing

Gowal, Sven, Uesato, Jonathan, Qin, Chongli, Huang, Po-Sen, Mann, Timothy, Kohli, Pushmeet

arXiv.org Machine LearningOct-21-2019

Adversarial testing methods based on Projected Gradient Descent (PGD) are widely used for searching norm-bounded perturbations that cause the inputs of neural networks to be misclassified. This paper takes a deeper look at these methods and explains the effect of different hyperparameters (i.e., optimizer, step size and surrogate loss). We introduce the concept of MultiTargeted testing, which makes clever use of alternative surrogate losses, and explain when and how MultiTargeted is guaranteed to find optimal perturbations. Finally, we demonstrate that MultiTargeted outperforms more sophisticated methods and often requires less iterative steps than other variants of PGD found in the literature. Notably, MultiTargeted ranks first on MadryLab's white-box MNIST and CIFAR-10 leaderboards, reducing the accuracy of their MNIST model to 88.36% (with $\ell_\infty$ perturbations of $\epsilon = 0.3$) and the accuracy of their CIFAR-10 model to 44.03% (at $\epsilon = 8/255$). MultiTargeted also ranks first on the TRADES leaderboard reducing the accuracy of their CIFAR-10 model to 53.07% (with $\ell_\infty$ perturbations of $\epsilon = 0.031$).

adversarial input, arxiv preprint arxiv, multitargeted, (9 more...)

arXiv.org Machine Learning

1910.09338

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

On the Convergence of Perturbed Distributed Asynchronous Stochastic Gradient Descent to Second Order Stationary Points in Non-convex Optimization

Wang, Lifu, Shen, Bo, Zhao, Ning

arXiv.org Machine LearningOct-21-2019

In this paper, the second order convergence of non-convex optimization in the asynchronous stochastic gradient descent (ASGD) algorithm is studied systematically. We investigate the behavior of ASGD near and away from saddle points. Different from the general stochastic gradient descent(SGD), we show that ASGD might return back even if it has escaped the saddle points, yet after staying near a strict saddle point for a long enough time ($O(T)$), ASGD will finally go away from strict saddle points. An inequality is given to describe the process of ASGD to escape saddle points. Using a novel Razumikhin-Lyapunov method, we show the exponential instability of the perturbed gradient dynamics near the strict saddle points and give a more detailed estimation about how the time delay parameter $T$ influences the speed to escape. In particular, we consider the optimization of smooth nonconvex functions, and propose a perturbed asynchronous stochastic gradient descent algorithm with guarantee of convergence to second order stationary points with high probability in $O(1/\epsilon^4)$ iterations. To the best of our knowledge, this is the first work on the second order convergence of asynchronous algorithm.

algorithm, saddle point, theorem, (13 more...)

arXiv.org Machine Learning

1910.06

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Rocky Mountains (0.04)
(4 more...)

Genre: Research Report (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Fast Exact Matrix Completion: A Unifying Optimization Framework

Bertsimas, Dimitris, Li, Michael Lingzhi

arXiv.org Machine LearningOct-20-2019

We consider the problem of matrix completion of rank $k$ on an $n\times m$ matrix. We show that both the general case and the case with side information can be formulated as a combinatorical problem of selecting $k$ vectors from $p$ column features. We demonstrate that it is equivalent to a separable optimization problem that is amenable to stochastic gradient descent. We design fastImpute, based on projected stochastic gradient descent, to enable efficient scaling of the algorithm of sizes of $10^5 \times 10^5$. We report experiments on both synthetic and real-world datasets that show fastImpute is competitive in both the accuracy of the matrix recovered and the time needed across all cases. Furthermore, when a high number of entries are missing, fastImpute is over $75\%$ lower in MAPE and $10$x faster than current state-of-the-art matrix completion methods in both the case with side information and without.

algorithm, completion, matrix, (11 more...)

arXiv.org Machine Learning

1910.09092

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report (0.50)

Industry:

Health & Medicine (0.93)
Leisure & Entertainment (0.68)
Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

First-Order Preconditioning via Hypergradient Descent

Moskovitz, Ted, Wang, Rui, Lan, Janice, Kapoor, Sanyam, Miconi, Thomas, Yosinski, Jason, Rawal, Aditya

arXiv.org Machine LearningOct-18-2019

A BSTRACT Standard gradient descent methods are susceptible to a range of issues that can impede training, such as high correlations and different scaling in parameter space. These difficulties can be addressed by second-order approaches that apply a preconditioning matrix to the gradient to improve convergence. Unfortunately, such algorithms typically struggle to scale to high-dimensional problems, in part because the calculation of specific preconditioners such as the inverse Hessian or Fisher information matrix is highly expensive. We introduce first-order preconditioning (FOP), a fast, scalable approach that generalizes previous work on hyper-gradient descent (Almeida et al., 1998; Maclaurin et al., 2015; Baydin et al., 2017) to learn a preconditioning matrix that only makes use of first-order information. Experiments show that FOP is able to improve the performance of standard deep learning optimizers on several visual classification tasks with minimal computational overhead. We also investigate the properties of the learned preconditioning matrices and perform a preliminary theoretical analysis of the algorithm. Despite this, deep neural networks and other large-scale machine learning models applied to such problems typically rely on simple variations of gradient descent to train, which is known to be highly sensitive to these difficulties.

fop, gradient descent, matrix, (14 more...)

arXiv.org Machine Learning

1910.08461

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Russia (0.04)
Asia > Russia (0.04)
(2 more...)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)

Add feedback

On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

Kumar, Harshat, Koppel, Alec, Ribeiro, Alejandro

arXiv.org Machine LearningOct-18-2019

Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold for in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle, which suggests that learning more slowly may lead to improved limit points, providing insight into the interplay between optimization and generalization in reinforcement learning.

approximation, assumption, convergence, (12 more...)

arXiv.org Machine Learning

1910.08412

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report (0.63)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.61)

Add feedback

Linear Regression with Gradient Descent from Scratch in Numpy

#artificialintelligenceOct-17-2019, 16:18:32 GMT

I strongly advise you to read the article linked above. It will set the foundations on the topic, plus some math is already discussed there. To start out, I'll define my dataset -- only three points that are in a linear relationship. I've chosen so few points only because the math will be shorter -- needless to say, the math won't be more complex for longer dataset, it would just be longer, and I don't want to make some stupid arithmetic mistake. Then I'll set coefficients beta 0 and beta 1 to some constant and define the cost function as Sum of Squared Residuals (SSR/SSE).

cost function, gradient descent, linear regression, (7 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.43)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.40)

Add feedback

Demystifying Different Variants of Gradient Descent Optimization Algorithm

#artificialintelligenceOct-17-2019, 12:58:50 GMT

In this post, we have looked at the batch gradient descent, the need to develop new optimization techniques, and then we briefly discussed how to interpret contour plots. After that, we have looked at behind six different optimization techniques and three different data strategies (batch, mini-batch & stochastic) with an intuitive understanding which helps to know where to use any of these algorithms. In Practice Adam optimizer with mini-batch of sizes 32, 64 and 128 is the default choice, at least for all the image classification tasks which deal with CNN and large sequence to sequence models.

descent, gradient descent, history, (13 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)

Add feedback

Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network

Tian, Yuandong

arXiv.org Machine LearningOct-17-2019

A BSTRACT To analyze deep ReLU network, we adopt a student-teacher setting in which an over-parameterized student network learns from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). First, we prove that when the gradient is zero (or bounded above by a small constant) at every data point in training, a situation called interpolation setting, there exists many-to-one alignment between student and teacher nodes in the lowest layer under mild conditions. This suggests that generalization in unseen dataset is achievable, even the same condition often leads to zero training error. Second, analysis of noisy recovery and training dynamics in 2-layer network shows that strong teacher nodes (with large fan-out weights) are learned first and subtle teacher nodes are left unlearned until late stage of training. As a result, it could take a long time to converge into these small-gradient critical points. Our analysis shows that over-parameterization plays two roles: (1) it is a necessary condition for alignment to happen at the critical points, and (2) in training dynamics, it helps student nodes cover more teacher nodes with fewer iterations. Although networks with even one-hidden layer can fit any function (Hornik et al., 1989), it remains an open question how such networks can generalize to new data. Different from what traditional machine learning theory predicts, empirical evidence (Zhang et al., 2017) shows more parameters in neural network lead to better generalization. How over-parameterization yields strong generalization is an important question for understanding how deep learning works. In this paper, we analyze multi-layer ReLU networks by adopting teacher-student setting. The fixed teacher network provides the output for the student to learn via SGD. The student is over-parameterized (or over-realized): it has more nodes than the teacher. Therefore, there exists student weights whose gradient at every data point is zero. Here, we want to study the inverse problem: With small gradient at every training sample, can the student weights recover the teachers'? If so, then the generalization performance can be guaranteed if the training converges to such critical points. In this paper, we show that this so-called interpolation setting (Ma et al., 2017; Liu & Belkin, 2018; Bassily et al., 2018) leads to alignment: under certain conditions, each teacher node is provably aligned with at least one student node in the lowest layer. The condition is simply that the teacher node is observed by at least one student node, i.e., teacher's ReLU boundary lies in the activation region of that student. Therefore, more over-parameterization increases the probability of teachers being observed and thus being aligned. Furthermore, in 2-layer case, those student nodes that are not aligned with any teacher have zero contribution to the output and can be pruned.

artificial intelligence, machine learning, teacher node, (18 more...)

arXiv.org Machine Learning

1909.13458

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Genre: Research Report (0.50)

Industry:

Education (0.66)
Materials > Chemicals > Specialty Chemicals (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Mirror Descent View for Neural Network Quantization

Ajanthan, Thalaiyasingam, Gupta, Kartik, Torr, Philip H. S., Hartley, Richard, Dokania, Puneet K.

arXiv.org Machine LearningOct-17-2019

Quantizing large Neural Networks (NN) while maintaining the performance is highly desirable for resource-limited devices due to reduced memory and time complexity. NN quantization is usually formulated as a constrained optimization problem and optimized via a modified version of gradient descent. In this work, by interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, we introduce a Mirror Descent (MD) framework (Bubeck (2015)) for NN quantization. Specifically, we provide conditions on the projections (i.e., mapping from continuous to quantized ones) which would enable us to derive valid mirror maps and in turn the respective MD updates. Furthermore, we discuss a numerically stable implementation of MD by storing an additional set of auxiliary dual variables (continuous). This update is strikingly analogous to the popular Straight Through Estimator (STE) based method which is typically viewed as a "trick" to avoid vanishing gradients issue but here we show that it is an implementation method for MD for certain projections. Our experiments on standard classification datasets (CIFAR-10/100, TinyImageNet) with convolutional and residual architectures show that our MD variants obtain fully-quantized networks with accuracies very close to the floating-point networks.

mirror map, projection, quantization, (15 more...)

arXiv.org Machine Learning

1910.08237

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Oceania > Australia (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report (0.50)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Improving the convergence of SGD through adaptive batch sizes

Sievert, Scott, Charles, Zachary

arXiv.org Machine LearningOct-17-2019

Mini-batch stochastic gradient descent (SGD) approximates the gradient of an objective function with the average gradient of some batch of constant size. While small batch sizes can yield high-variance gradient estimates that prevent the model from learning a good model, large batches may require more data and computational effort. This work presents a method to change the batch size adaptively with model quality. We show that our method requires the same number of model updates as full-batch gradient descent while requiring the same total number of gradient computations as SGD. While this method requires evaluating the objective function, we present a passive approximation that eliminates this constraint and improves computational efficiency. We provide extensive experiments illustrating that our methods require far fewer model updates without increasing the total amount of computation.

batch size, computation, model update, (15 more...)

arXiv.org Machine Learning

1910.08222

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report (0.50)

Industry:

Education (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback