AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Solving Non-Convex Non-Differentiable Min-Max Games using Proximal Gradient Method

arXiv.org Machine LearningMar-18-2020

Min-max saddle point games appear in a wide range of applications in machine leaning and signal processing. Despite their wide applicability, theoretical studies are mostly limited to the special convex-concave structure. While some recent works generalized these results to special smooth non-convex cases, our understanding of non-smooth scenarios is still limited. In this work, we study special form of non-smooth min-max games when the objective function is (strongly) convex with respect to one of the player's decision variable. We show that a simple multi-step proximal gradient descent-ascent algorithm converges to $\epsilon$-first-order Nash equilibrium of the min-max game with the number of gradient evaluations being polynomial in $1/\epsilon$. We will also show that our notion of stationarity is stronger than existing ones in the literature. Finally, we evaluate the performance of the proposed algorithm through adversarial attack on a LASSO estimator.

algorithm, algorithm 1, concave, (12 more...)

arXiv.org Machine Learning

2003.08093

Country:

North America > United States > California (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry: Government (0.34)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

The Implicit Regularization of Stochastic Gradient Flow for Least Squares

Ali, Alnur, Dobriban, Edgar, Tibshirani, Ryan J.

arXiv.org Machine LearningMar-17-2020

We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression. We leverage a continuous-time stochastic differential equation having the same moments as stochastic gradient descent, which we call stochastic gradient flow. We give a bound on the excess risk of stochastic gradient flow at time $t$, over ridge regression with tuning parameter $\lambda = 1/t$. The bound may be computed from explicit constants (e.g., the mini-batch size, step size, number of iterations), revealing precisely how these quantities drive the excess risk. Numerical examples show the bound can be small, indicating a tight relationship between the two estimators. We give a similar result relating the coefficients of stochastic gradient flow and ridge. These results hold under no conditions on the data matrix $X$, and across the entire optimization path (not just at convergence).

exp, gradient flow, sgf, (9 more...)

arXiv.org Machine Learning

2003.07802

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Explaining Memorization and Generalization: A Large-Scale Study with Coherent Gradients

Zielinski, Piotr, Krishnan, Shankar, Chatterjee, Satrajit

arXiv.org Machine LearningMar-16-2020

Coherent Gradients is a recently proposed hypothesis to explain why over-parameterized neural networks trained with gradient descent generalize well even though they have sufficient capacity to memorize the training set. Inspired by random forests, Coherent Gradients proposes that (Stochastic) Gradient Descent (SGD) finds common patterns amongst examples (if such common patterns exist) since descent directions that are common to many examples add up in the overall gradient, and thus the biggest changes to the network parameters are those that simultaneously help many examples. The original Coherent Gradients paper validated the theory through causal intervention experiments on shallow, fully connected networks on MNIST. In this work, we perform similar intervention experiments on more complex architectures (such as VGG, Inception and ResNet) on more complex datasets (such as CIFAR-10 and ImageNet). Our results are in good agreement with the small scale study in the original paper, thus providing the first validation of coherent gradients in more practically relevant settings. We also confirm in these settings that suppressing incoherent updates by natural modifications to SGD can significantly reduce overfitting--lending credence to the hypothesis that memorization occurs when few examples are responsible for most of the gradient used in the update. Furthermore, we use the coherent gradients theory to explore a new characterization of why some examples are learned earlier than other examples, i.e., "easy" and "hard" examples.

artificial intelligence, gradient, machine learning, (18 more...)

arXiv.org Machine Learning

2003.07422

Country:

North America > United States > New York (0.14)
Oceania > Australia (0.14)
Europe > Sweden (0.14)

Genre: Research Report > New Finding (0.34)

Industry:

Materials > Chemicals > Industrial Gases > Liquified Gas (0.93)
Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.93)
Energy > Oil & Gas > Midstream (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (0.63)

Add feedback

Simulated annealing based heuristic for multiple agile satellites scheduling under cloud coverage uncertainty

Han, Chao, Gu, Yi, Wu, Guohua, Wang, Xinwei

arXiv.org Artificial IntelligenceMar-14-2020

Agile satellites are the new generation of Earth observation satellites (EOSs) with stronger attitude maneuvering capability. Since optical remote sensing instruments equipped on satellites cannot see through the cloud, the cloud coverage has a significant influence on the satellite observation missions. We are the first to address multiple agile EOSs scheduling problem under cloud coverage uncertainty where the objective aims to maximize the entire observation profit. The chance constraint programming model is adopted to describe the uncertainty initially, and the observation profit under cloud coverage uncertainty is then calculated via sample approximation method. Subsequently, an improved simulated annealing based heuristic combining a fast insertion strategy is proposed for large-scale observation missions. The experimental results show that the improved simulated annealing heuristic outperforms other algorithms for the multiple AEOSs scheduling problem under cloud coverage uncertainty, which verifies the efficiency and effectiveness of the proposed algorithm.

algorithm, satellite, scheduling, (12 more...)

arXiv.org Artificial Intelligence

2003.08363

Country:

Oceania > Australia (0.04)
North America > United States (0.04)
Asia > China > Hunan Province > Changsha (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Energy (0.48)
Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.81)

Add feedback

Balancedness and Alignment are Unlikely in Linear Neural Networks

Radhakrishnan, Adityanarayanan, Nichani, Eshaan, Bernstein, Daniel, Uhler, Caroline

arXiv.org Machine LearningMar-13-2020

We study the invariance properties of alignment in linear neural networks under gradient descent. Alignment of weight matrices is a form of implicit regularization, and previous works have studied this phenomenon in fully connected networks with 1-dimensional outputs. In such networks, we prove that there exists an initialization such that adjacent layers remain aligned throughout training under any real-valued loss function. We then define alignment for fully connected networks with multidimensional outputs and prove that it generally cannot be an invariant for such networks under the squared loss. Moreover, we characterize the datasets under which alignment is possible. We then analyze networks with layer constraints such as convolutional networks. In particular, we prove that gradient descent is equivalent to projected gradient descent, and show that alignment is impossible given sufficiently large datasets. Importantly, since our definition of alignment is a relaxation of balancedness, our negative results extend to this property.

alignment, invariant, matrix, (14 more...)

arXiv.org Machine Learning

2003.0634

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Massachusetts (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.78)

Add feedback

Can Implicit Bias Explain Generalization? Stochastic Convex Optimization as a Case Study

Dauber, Assaf, Feder, Meir, Koren, Tomer, Livni, Roi

arXiv.org Machine LearningMar-13-2020

The notion of implicit bias, or implicit regularization, has been suggested as a means to explain the surprising generalization ability of modern-days overparameterized learning algorithms. This notion refers to the tendency of the optimization algorithm towards a certain structured solution that often generalizes well. Recently, several papers have studied implicit regularization and were able to identify this phenomenon in various scenarios. We revisit this paradigm in arguably the simplest non-trivial setup, and study the implicit bias of Stochastic Gradient Descent (SGD) in the context of Stochastic Convex Optimization. As a first step, we provide a simple construction that rules out the existence of a \emph{distribution-independent} implicit regularizer that governs the generalization ability of SGD. We then demonstrate a learning problem that rules out a very general class of \emph{distribution-dependent} implicit regularizers from explaining generalization, which includes strongly convex regularizers as well as non-degenerate norm-based regularizations. Certain aspects of our constructions point out to significant difficulties in providing a comprehensive explanation of an algorithm's generalization performance by solely arguing about its implicit regularization properties.

construction, regularization, regularizer, (15 more...)

arXiv.org Machine Learning

2003.06152

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Wasserstein-based Graph Alignment

Maretic, Hermina Petric, Gheche, Mireille El, Minder, Matthias, Chierchia, Giovanni, Frossard, Pascal

arXiv.org Machine LearningMar-12-2020

We propose a novel method for comparing non-aligned graphs of different sizes, based on the Wasserstein distance between graph signal distributions induced by the respective graph Laplacian matrices. Specifically, we cast a new formulation for the one-to-many graph alignment problem, which aims at matching a node in the smaller graph with one or more nodes in the larger graph. By integrating optimal transport in our graph comparison framework, we generate both a structurally-meaningful graph distance, and a signal transportation plan that models the structure of graph data. The resulting alignment problem is solved with stochastic gradient descent, where we use a novel Dykstra operator to ensure that the solution is a one-to-many (soft) assignment matrix. We demonstrate the performance of our novel framework on graph alignment and graph classification, and we show that our method leads to significant improvements with respect to the state-of-the-art algorithms for each of these tasks.

algorithm, graph, matrix, (15 more...)

arXiv.org Machine Learning

2003.06048

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
North America > Canada > Quebec > Montreal (0.04)
(3 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

Machine Learning on Volatile Instances

Zhang, Xiaoxi, Wang, Jianyu, Joshi, Gauri, Joe-Wong, Carlee

arXiv.org Machine LearningMar-12-2020

Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple worker nodes. However, running distributed SGD can be prohibitively expensive because it may require specialized computing resources such as GPUs for extended periods of time. We propose cost-effective strategies to exploit volatile cloud instances that are cheaper than standard instances, but may be interrupted by higher priority workloads. To the best of our knowledge, this work is the first to quantify how variations in the number of active worker nodes (as a result of preemption) affects SGD convergence and the time to train the model. By understanding these trade-offs between preemption probability of the instances, accuracy, and training time, we are able to derive practical strategies for configuring distributed SGD jobs on volatile instances such as Amazon EC2 spot instances and other preemptible cloud instances. Experimental results show that our strategies achieve good training performance at substantially lower cost.

active worker, iteration, runtime, (17 more...)

arXiv.org Machine Learning

2003.05649

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Oregon (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.34)

Industry:

Information Technology > Services (0.46)
Banking & Finance > Trading (0.33)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback

Improving the Backpropagation Algorithm with Consequentialism Weight Updates over Mini-Batches

Paeedeh, Naeem, Ghiasi-Shirazi, Kamaledin

arXiv.org Machine LearningMar-11-2020

Least mean squares (LMS) is a particular case of the backpropagation (BP) algorithm applied to single-layer neural networks with the mean squared error (MSE) loss. One drawback of the LMS is that the instantaneous weight update is proportional to the square of the norm of the input vector. Normalized least mean squares (NLMS) algorithm amends this drawback by dividing the weight changes by the square of the norm of the input vector. The affine projection algorithm (APA) improved the NLMS algorithm to weight update over a batch of recently seen samples. However, the application of NLMS and APA had been limited to single-layer networks and adaptive filters. In this paper, we consider a virtual target for each neuron of a multi-layer neural network and show that the BP algorithm is equivalent to training the weights of each layer using these virtual targets and the LMS algorithm. We also introduce a consequentialism interpretation of the NLMS and the APA algorithms that justifies their use in multi-layer neural networks. Given any optimization algorithm based on the BP over mini-batches, we propose a novel consequentialism method for updating the weights.Consequently, our proposed weight update can be applied both to plain stochastic gradient descent (SGD) and to momentum methods like RMSProp, Adam, and NAG. These ideas helped us to update the weights more carefully in such a way that minimization of the loss for one sample of the mini-batch does not interfere with other samples in that mini-batch. Our experiments show the usefulness of the proposed method in optimizing deep neural network architectures.

algorithm, deep learning, neural network, (21 more...)

arXiv.org Machine Learning

2003.05164

Country: Asia > Middle East > Iran (0.14)

Genre: Research Report (0.50)

Industry: Energy > Oil & Gas (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Backpropagation (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Memory-efficient Learning for Large-scale Computational Imaging

Kellman, Michael, Zhang, Kevin, Tamir, Jon, Bostan, Emrah, Lustig, Michael, Waller, Laura

arXiv.org Machine LearningMar-11-2020

Critical aspects of computational imaging systems, such as experimental design and image priors, can be optimized through deep networks formed by the unrolled iterations of classical model-based reconstructions (termed physics-based networks). However, for real-world large-scale inverse problems, computing gradients via backpropagation is infeasible due to memory limitations of graphics processing units. In this work, we propose a memory-efficient learning procedure that exploits the reversibility of the network's layers to enable data-driven design for large-scale computational imaging systems. We demonstrate our method on a small-scale compressed sensing example, as well as two large-scale real-world systems: multi-channel magnetic resonance imaging and super-resolution optical microscopy.

iteration, memory-efficient learning, reconstruction, (15 more...)

arXiv.org Machine Learning

2003.05551

Country: North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)

Genre: Research Report (0.84)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.48)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.91)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

Add feedback