AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

A near-optimal stochastic gradient method for decentralized non-convex finite-sum optimization

arXiv.org Machine LearningSep-15-2020

This paper describes a $near$-$optimal$ stochastic first-order gradient method for decentralized finite-sum minimization of smooth non-convex functions. Specifically, we propose GT-SARAH that employs a local SARAH-type variance reduction and global gradient tracking to address the stochastic and decentralized nature of the problem. Considering a total number of $N$ cost functions, equally divided over a directed network of $n$ nodes, we show that GT-SARAH finds an $\epsilon$-accurate first-order stationary point in ${\mathcal{O}(N^{1/2}\epsilon^{-1})}$ gradient computations across all nodes, independent of the network topology, when ${n\leq\mathcal{O}(N^{1/2}(1-\lambda)^{3})}$, where ${(1-\lambda)}$ is the spectral gap of the network weight matrix. In this regime, GT-SARAH is thus, to the best our knowledge, the first decentralized method that achieves the algorithmic lower bound for this class of problems. Moreover, GT-SARAH achieves a $non$-$asymptotic$ $linear$ $speedup$, in that, the total number of gradient computations at each node is reduced by a factor of $1/n$ compared to the near-optimal algorithms for this problem class that process all data at a single node. We also establish the convergence rate of GT-SARAH in other regimes, in terms of the relative sizes of the number of nodes $n$, total number of functions $N$, and the network spectral gap $(1-\lambda)$. Over infinite time horizon, we establish the almost sure and mean-squared convergence of GT-SARAH to a first-order stationary point.

artificial intelligence, gt-sarah, machine learning, (18 more...)

arXiv.org Machine Learning

2008.07428

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Massachusetts > Middlesex County > Medford (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.52)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.51)

Add feedback

Learning Functors using Gradient Descent

Gavranović, Bruno

arXiv.org Artificial IntelligenceSep-14-2020

Neural networks are a general framework for differentiable optimization which includes many other machine learning approaches as special cases. In this paper we build a category-theoretic formalism around a neural network system called CycleGAN. CycleGAN is a general approach to unpaired image-to-image translation that has been getting attention in the recent years. Inspired by categorical database systems, we show that CycleGAN is a "schema", i.e. a specific category presented by generators and relations, whose specific parameter instantiations are just set-valued functors on this schema. We show that enforcing cycle-consistencies amounts to enforcing composition invariants in this category. We generalize the learning procedure to arbitrary such categories and show a special class of functors, rather than functions, can be learned using gradient descent. Using this framework we design a novel neural network system capable of learning to insert and delete objects from images without paired data. We qualitatively evaluate the system on the CelebA dataset and obtain promising results.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.4204/EPTCS.323.15

2009.06837

Country:

North America > United States > New York (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

A Vertical Federated Learning Method for Interpretable Scorecard and Its Application in Credit Scoring

Zheng, Fanglan, Erihe, null, Li, Kun, Tian, Jiang, Xiang, Xiaojia

arXiv.org Machine LearningSep-14-2020

With the success of big data and artificial intelligence in many fields, the applications of big data driven models are expected in financial risk management especially credit scoring and rating. Under the premise of data privacy protection, we propose a projected gradient-based method in the vertical federated learning framework for the traditional scorecard, which is based on logistic regression with bounded constraints, namely FL-LRBC. The latter enables multiple agencies to jointly train an optimized scorecard model in a single training session. It leads to the formation of the model with positive coefficients, while the time-consuming parameter-tuning process can be avoided. Moreover, the performance in terms of both AUC and the Kolmogorov-Smirnov (KS) statistics is significantly improved due to data enrichment using FL-LRBC. At present, FL-LRBC has already been applied to credit business in a China nation-wide financial holdings group.

artificial intelligence, machine learning, scorecard model, (13 more...)

arXiv.org Machine Learning

2009.06218

Country:

Asia > China (0.25)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre:

Research Report > New Finding (0.50)
Research Report > Experimental Study (0.35)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance > Credit (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback

A Qualitative Study of the Dynamic Behavior of Adaptive Gradient Algorithms

Ma, Chao, Wu, Lei, E, Weinan

arXiv.org Machine LearningSep-13-2020

The dynamic behavior of RMSprop and Adam algorithms is studied through a combination of careful numerical experiments and theoretical explanations. Three types of qualitative features are observed in the training loss curve: fast initial convergence, oscillations and large spikes. The sign gradient descent (signGD) algorithm, which is the limit of Adam when taking the learning rate to $0$ while keeping the momentum parameters fixed, is used to explain the fast initial convergence. For the late phase of Adam, three different types of qualitative patterns are observed depending on the choice of the hyper-parameters: oscillations, spikes and divergence. In particular, Adam converges faster and smoother when the values of the two momentum factors are close to each other.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

2009.06125

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

What is Gradient Descent?

#artificialintelligenceSep-12-2020, 21:05:36 GMT

This tutorial is on the basics of gradient descent. It is also a continuation of the Intro to Machine Learning post, "What is Machine Learning?", which can be found here. Gradient descent is a method of finding the optimal weights for a model. We use the gradient descent algorithm to find the best machine learning model, with the lowest error and highest accuracy. A common explanation of gradient descent is the idea of standing on an uneven baseball field, blindfolded, and you want to find the lowest point of the field.

artificial intelligence, gradient descent, machine learning, (11 more...)

#artificialintelligence

Country:

Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.05)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.05)
North America > United States > California > San Francisco County > San Francisco (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

A general framework for decentralized optimization with first-order methods

Xin, Ran, Pu, Shi, Nedić, Angelia, Khan, Usman A.

arXiv.org Machine LearningSep-12-2020

Decentralized optimization to minimize a finite sum of functions over a network of nodes has been a significant focus within control and signal processing research due to its natural relevance to optimal control and signal estimation problems. More recently, the emergence of sophisticated computing and large-scale data science needs have led to a resurgence of activity in this area. In this article, we discuss decentralized first-order gradient methods, which have found tremendous success in control, signal processing, and machine learning problems, where such methods, due to their simplicity, serve as the first method of choice for many complex inference and training tasks. In particular, we provide a general framework of decentralized first-order methods that is applicable to undirected and directed communication networks alike, and show that much of the existing work on optimization and consensus can be related explicitly to this framework. We further extend the discussion to decentralized stochastic first-order methods that rely on stochastic gradients at each node and describe how local variance reduction schemes, previously shown to have promise in the centralized settings, are able to improve the performance of decentralized methods when combined with what is known as gradient tracking. We motivate and demonstrate the effectiveness of the corresponding methods in the context of machine learning and signal processing problems that arise in decentralized environments.

artificial intelligence, machine learning, optimization problem, (16 more...)

arXiv.org Machine Learning

2009.05837

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
North America > United States > Massachusetts > Middlesex County > Medford (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
(13 more...)

Genre: Research Report (0.82)

Industry: Education (0.66)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.37)

Add feedback

Momentum-based Gradient Methods in Multi-objective Recommender Systems

Mitrevski, Blagoj, Filipovic, Milena, Antognini, Diego, Glaude, Emma Lejal, Faltings, Boi, Musat, Claudiu

arXiv.org Artificial IntelligenceSep-10-2020

Multi-objective gradient methods are becoming the standard for solving multi-objective problems. Among others, they show promising results in developing multi-objective recommender systems with both correlated and uncorrelated objectives. Classic multi-gradient descent usually relies on the combination of the gradients, not including the computation of first and second moments of the gradients. This leads to a brittle behavior and misses important areas in the solution space. In this work, we create a multi-objective Adamize method that leverage the benefits of the Adam optimizer in single-objective problems. This corrects and stabilizes the gradients of every objective before calculating a common gradient descent vector that optimizes all the objectives simultaneously. We evaluate the benefits of Multi-objective Adamize on two multi-objective recommender systems and for three different objective combinations, both correlated or uncorrelated. We report significant improvements, measured with three different Pareto front metrics: hypervolume, coverage, and spacing. Finally, we show that the Adamized Pareto front strictly dominates the previous one on multiple objective pairs.

artificial intelligence, machine learning, objective, (15 more...)

arXiv.org Artificial Intelligence

2009.04695

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Switzerland > Vaud > Lausanne (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback

Partial local entropy and anisotropy in deep weight spaces

Musso, Daniele

arXiv.org Machine LearningSep-10-2020

Recent studies on the weight space of deep neural networks [1, 2] have highlighted the existence of rare subdominant clusters of configurations which yield a high test accuracy. Although these clusters constitute a deviation from typicality, they are efficiently encountered by stochastic gradient descent (SGD) algorithms and correspond to wide valleys of suitable loss functions, such as cross entropy [3]. An analogous circumstance occurs in the context of constraint satisfaction problems, where the chase after clusters of solutions is improved when the loss function gets supplemented by a term that encourages a local high density of solutions [4]. In order to find the number of solutions contained in a vicinity of a specific weight configuration, one can define a local solution-counting functional, namely, a local entropy. Classification tasks performed by means of quantized neural networks (where the weights are discrete) can be interpreted as constraint satisfaction problems. There are however two reasons to generalize the concept of local entropy: First, classification problems are typically required to reach a high but not necessarily perfect accuracy; second, they are often approached with machines that have continuous weights.

neural network, regularization, weight space, (15 more...)

arXiv.org Machine Learning

2007.09091

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Spain > Galicia > A Coruña Province > Santiago de Compostela (0.04)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.57)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Quantifying the Preferential Direction of the Model Gradient in Adversarial Training With Projected Gradient Descent

Lanfredi, Ricardo Bigolin, Schroeder, Joyce D., Tasdizen, Tolga

arXiv.org Machine LearningSep-10-2020

Adversarial training, especially projected gradient descent (PGD), has been the most successful approach for improving robustness against adversarial attacks. After adversarial training, gradients of models with respect to their inputs are meaningful and interpretable by humans. However, the concept of interpretability is not mathematically well established, making it difficult to evaluate it quantitatively. We define interpretability as the alignment of the model gradient with the vector pointing toward the closest point of the support of the other class. We propose a method for measuring this alignment for binary classification problems, using generative adversarial model training to produce the smallest residual needed to change the class present in the image. We show that PGD-trained models are more interpretable than the baseline according to our definition, and our metric presents higher alignment values than a competing metric formulation. We also show that enforcing this alignment increases the robustness of models without adversarial training.

artificial intelligence, machine learning, robustness, (19 more...)

arXiv.org Machine Learning

2009.04709

Country:

North America > United States > California > Los Angeles County > Long Beach (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
(11 more...)

Genre: Research Report (0.83)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.93)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Gradient Methods Never Overfit On Separable Data

Shamir, Ohad

arXiv.org Machine LearningSep-10-2020

A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) never overfit on separable data: If we run these methods for $T$ iterations on a dataset of size $m$, both the empirical risk and the generalization error decrease at an essentially optimal rate of $\tilde{\mathcal{O}}(1/\gamma^2 T)$ up till $T\approx m$, at which point the generalization error remains fixed at an essentially optimal level of $\tilde{\mathcal{O}}(1/\gamma^2 m)$ regardless of how large $T$ is. Along the way, we present non-asymptotic bounds on the number of margin violations over the dataset, and prove their tightness.

artificial intelligence, gradient descent, machine learning, (18 more...)

arXiv.org Machine Learning

2007.00028

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.80)

Add feedback