AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Mirrorless Mirror Descent: A More Natural Discretization of Riemannian Gradient Flow

Gunasekar, Suriya, Woodworth, Blake, Srebro, Nathan

arXiv.org Machine LearningApr-2-2020

We present a direct (primal only) derivation of Mirror Descent as a "partial" discretization of gradient flow on a Riemannian manifold where the metric tensor is the Hessian of the Mirror Descent potential function. We argue that this discretization is more faithful to the geometry than Natural Gradient Descent, which is obtained by a "full" forward Euler discretization. This view helps shed light on the relationship between the methods and allows generalizing Mirror Descent to any Riemannian geometry, even when the metric tensor is not a Hessian, and thus there is no "dual".

descent, discretization, mirror descent, (11 more...)

arXiv.org Machine Learning

2004.01025

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.38)

Add feedback

Stochastic gradient descent with random learning rate

Musso, Daniele

arXiv.org Machine LearningApr-1-2020

We propose to optimize neural networks with a uniformly-distributed random learning rate. The associated stochastic gradient descent algorithm can be approximated by continuous stochastic equations and analyzed with the Fokker-Planck formalism. In the small learning rate approximation, the training process is characterized by an effective temperature which depends on the average learning rate, the mini-batch size and the momentum of the optimization algorithm. By comparing the random learning rate protocol with cyclic and constant protocols, we suggest that the random choice is generically the best strategy in the small learning rate regime, yielding better regularization without extra computational cost. We provide supporting evidence through experiments on both shallow, fully-connected and deep, convolutional neural networks for image classification on the MNIST and CIFAR10 datasets.

effective temperature, learning rate, protocol, (15 more...)

arXiv.org Machine Learning

2003.06926

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Spain > Galicia > A Coruña Province > Santiago de Compostela (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Second-Order Guarantees in Centralized, Federated and Decentralized Nonconvex Optimization

Vlaski, Stefan, Sayed, Ali H.

arXiv.org Machine LearningMar-31-2020

Rapid advances in data collection and processing capabilities have allowed for the use of increasingly complex models that give rise to nonconvex optimization problems. These formulations, however, can be arbitrarily difficult to solve in general, in the sense that even simply verifying that a given point is a local minimum can be NPhard [1]. Still, some relatively simple algorithms have been shown to lead to surprisingly good empirical results in many contexts of interest. Perhaps the most prominent example is the success of the backpropagation algorithm for training neural networks. Several recent works have pursued rigorous analytical justification for this phenomenon by studying the structure of the nonconvex optimization problems and establishing that simple algorithms, such as gradient descent and its variations, perform well in converging towards local minima and avoiding saddle-points. A key insight in these analyses is that gradient perturbations play a critical role in allowing local descent algorithms to efficiently distinguish desirable from undesirable stationary points and escape from the latter. In this article, we cover recent results on second-order guarantees for stochastic first-order optimization algorithms in centralized, federated, and decentralized architectures. A key desirable feature of automated learning algorithms is the ability to learn models directly from data with minimal need for direct intervention by the designer. The authors are with the Institute of Electrical Engineering, École Polytechnique Fédérale de Lausanne.

algorithm, second-order guarantee, stationary point, (12 more...)

arXiv.org Machine Learning

2003.14366

Country:

Europe > Switzerland > Vaud > Lausanne (0.24)
Asia > Middle East > Jordan (0.04)
North America > Canada > Quebec > Montreal (0.04)
(5 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)

Add feedback

Information-Theoretic Lower Bounds for Zero-Order Stochastic Gradient Estimation

Alabdulkareem, Abdulrahman, Honorio, Jean

arXiv.org Machine LearningMar-30-2020

In this paper we analyze the necessary number of samples to estimate the gradient of any multidimensional smooth (possibly non-convex) function in a zero-order stochastic oracle model. In this model, an estimator has access to noisy values of the function, in order to produce the estimate of the gradient. We also provide an analysis on the sufficient number of samples for the finite difference method, a classical technique in numerical linear algebra. For $T$ samples and $d$ dimensions, our information-theoretic lower bound is $\Omega(\sqrt{d/T})$. We show that the finite difference method has rate $O(d^{4/3}/\sqrt{T})$ for functions with zero third and higher order derivatives. Thus, the finite difference method is not minimax optimal, and therefore there is space for the development of better gradient estimation methods.

estimation, estimator, oracle, (13 more...)

arXiv.org Machine Learning

2003.13881

Country:

North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
North America > United States > New York (0.04)
North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)

Add feedback

Adaptive Group Sparse Regularization for Continual Learning

Jung, Sangwon, Ahn, Hongjoon, Cha, Sungmin, Moon, Taesup

arXiv.org Machine LearningMar-30-2020

We propose a novel regularization-based continual learning method, dubbed as Adaptive Group Sparsity based Continual Learning (AGS-CL), using two group sparsity-based penalties. Our method selectively employs the two penalties when learning each node based its the importance, which is adaptively updated after learning each new task. By utilizing the proximal gradient descent method for learning, the exact sparsity and freezing of the model is guaranteed, and thus, the learner can explicitly control the model capacity as the learning continues. Furthermore, as a critical detail, we re-initialize the weights associated with unimportant nodes after learning each task in order to prevent the negative transfer that causes the catastrophic forgetting and facilitate efficient learning of new tasks. Throughout the extensive experimental results, we show that our AGS-CL uses much less additional memory space for storing the regularization parameters, and it significantly outperforms several state-of-the-art baselines on representative continual learning benchmarks for both supervised and reinforcement learning tasks.

learning, neural network, node, (15 more...)

arXiv.org Machine Learning

2003.13726

Country: Asia > South Korea > Gyeonggi-do > Suwon (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

Iterative Pre-Conditioning to Expedite the Gradient-Descent Method

Chakrabarti, Kushal, Gupta, Nirupam, Chopra, Nikhil

arXiv.org Machine LearningMar-29-2020

This paper considers the problem of multi-agent distributed optimization. In this problem, there are multiple agents in the system, and each agent only knows its local cost function. The objective for the agents is to collectively compute a common minimum of the aggregate of all their local cost functions. In principle, this problem is solvable using a distributed variant of the traditional gradient-descent method, which is an iterative method. However, the speed of convergence of the traditional gradient-descent method is highly influenced by the conditioning of the optimization problem being solved. Specifically, the method requires a large number of iterations to converge to a solution if the optimization problem is ill-conditioned. In this paper, we propose an iterative pre-conditioning approach that can significantly attenuate the influence of the problem's conditioning on the convergence-speed of the gradient-descent method. The proposed pre-conditioning approach can be easily implemented in distributed systems and has minimal computation and communication overhead. For now, we only consider a specific distributed optimization problem wherein the individual local cost functions of the agents are quadratic. Besides the theoretical guarantees, the improved convergence speed of our approach is demonstrated through experiments on a real data-set.

algorithm 1, matrix, server, (15 more...)

arXiv.org Machine Learning

2003.0718

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > Texas > Brazos County > College Station (0.04)
North America > United States > District of Columbia > Washington (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.90)

Add feedback

Differentially Private Federated Learning for Resource-Constrained Internet of Things

Hu, Rui, Guo, Yuanxiong, Ratazzi, E. Paul., Gong, Yanmin

arXiv.org Machine LearningMar-28-2020

With the proliferation of smart devices having built-in sensors, Internet connectivity, and programmable computation capability in the era of Internet of things (IoT), tremendous data is being generated at the network edge. Federated learning is capable of analyzing the large amount of data from a distributed set of smart devices without requiring them to upload their data to a central place. However, the commonly-used federated learning algorithm is based on stochastic gradient descent (SGD) and not suitable for resource-constrained IoT environments due to its high communication resource requirement. Moreover, the privacy of sensitive data on smart devices has become a key concern and needs to be protected rigorously. This paper proposes a novel federated learning framework called DP-PASGD for training a machine learning model efficiently from the data stored across resource-constrained smart devices in IoT while guaranteeing differential privacy. The optimal schematic design of DP-PASGD that maximizes the learning performance while satisfying the limits on resource cost and privacy loss is formulated as an optimization problem, and an approximate solution method based on the convergence analysis of DP-PASGD is developed to solve the optimization problem efficiently. Numerical results based on real-world datasets verify the effectiveness of the proposed DP-PASGD scheme.

accuracy, dp-pasgd, privacy, (17 more...)

arXiv.org Machine Learning

2003.12705

Country:

North America > United States > Texas > Bexar County > San Antonio (0.04)
North America > United States > California (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology > Smart Houses & Appliances (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Optimized Directed Roadmap Graph for Multi-Agent Path Finding Using Stochastic Gradient Descent

Henkel, Christian, Toussaint, Marc

arXiv.org Artificial IntelligenceMar-28-2020

We present a novel approach called Optimized Directed Roadmap Graph (ODRM). It is a method to build a directed roadmap graph that allows for collision avoidance in multi-robot navigation. This is a highly relevant problem, for example for industrial autonomous guided vehicles. The core idea of ODRM is, that a directed roadmap can encode inherent properties of the environment which are useful when agents have to avoid each other in that same environment. Like Probabilistic Roadmaps (PRMs), ODRM's first step is generating samples from C-space. In a second step, ODRM optimizes vertex positions and edge directions by Stochastic Gradient Descent (SGD). This leads to emergent properties like edges parallel to walls and patterns similar to two-lane streets or roundabouts. Agents can then navigate on this graph by searching their path independently and solving occurring agent-agent collisions at run-time. Using the graphs generated by ODRM compared to a non-optimized graph significantly fewer agent-agent collisions happen. We evaluate our roadmap with both, centralized and decentralized planners. Our experiments show that with ODRM even a simple centralized planner can solve problems with high numbers of agents that other multi-agent planners can not solve. Additionally, we use simulated robots with decentralized planners and online collision avoidance to show how agents are a lot faster on our roadmap than on standard grid maps.

agent, graph, roadmap, (14 more...)

arXiv.org Artificial Intelligence

2003.12924

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > Czechia > South Moravian Region > Brno (0.05)
Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.05)
(3 more...)

Genre: Research Report (1.00)

Industry: Transportation (0.95)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

A Hybrid-Order Distributed SGD Method for Non-Convex Optimization to Balance Communication Overhead, Computational Complexity, and Convergence Rate

Omidvar, Naeimeh, Maddah-Ali, Mohammad Ali, Mahdavi, Hamed

arXiv.org Machine LearningMar-27-2020

In this paper, we propose a method of distributed stochastic gradient descent (SGD), with low communication load and computational complexity, and still fast convergence. To reduce the communication load, at each iteration of the algorithm, the worker nodes calculate and communicate some scalers, that are the directional derivatives of the sample functions in some \emph{pre-shared directions}. However, to maintain accuracy, after every specific number of iterations, they communicate the vectors of stochastic gradients. To reduce the computational complexity in each iteration, the worker nodes approximate the directional derivatives with zeroth-order stochastic gradient estimation, by performing just two function evaluations rather than computing a first-order gradient vector. The proposed method highly improves the convergence rate of the zeroth-order methods, guaranteeing order-wise faster convergence. Moreover, compared to the famous communication-efficient methods of model averaging (that perform local model updates and periodic communication of the gradients to synchronize the local models), we prove that for the general class of non-convex stochastic problems and with reasonable choice of parameters, the proposed method guarantees the same orders of communication load and convergence rate, while having order-wise less computational complexity. Experimental results on various learning problems in neural networks applications demonstrate the effectiveness of the proposed approach compared to various state-of-the-art distributed SGD methods.

algorithm, balance communication overhead, iteration, (9 more...)

arXiv.org Machine Learning

2003.12423

Country:

South America > Paraguay > Asunción > Asunción (0.04)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)

Genre: Research Report (0.82)

Industry:

Education (0.48)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Negative Margin Matters: Understanding Margin in Few-shot Classification

Liu, Bin, Cao, Yue, Lin, Yutong, Li, Qi, Zhang, Zheng, Long, Mingsheng, Hu, Han

arXiv.org Machine LearningMar-26-2020

This paper introduces a negative margin loss to metric learning based few-shot learning methods. The negative margin loss significantly outperforms regular softmax loss, and achieves state-of-the-art accuracy on three standard few-shot classification benchmarks with few bells and whistles. These results are contrary to the common practice in the metric learning field, that the margin is zero or positive. To understand why the negative margin loss performs well for the few-shot classification, we analyze the discriminability of learned features w.r.t different margins for training and novel classes, both empirically and theoretically. We find that although negative margin reduces the feature discriminability for training classes, it may also avoid falsely mapping samples of the same novel class to multiple peaks or clusters, and thus benefit the discrimination of novel classes.

base class, novel class, softmax loss, (16 more...)

arXiv.org Machine Learning

2003.1206

Country:

North America > United States > California (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Add feedback