AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

The cyclic job-shop scheduling problem: The new subclass of the job-shop problem and applying the Simulated annealing to solve it

arXiv.org Artificial IntelligenceJun-18-2020

In the paper, the new approach to the scheduling problem are described. The approach deals with the problem of planning the cyclic production and proposes to consider such scheduling problem as the cyclic job-shop problem of the order k, where k is the number of reiterations. It was found out that planning of only one iteration of the loop is less effective than planning of the entire cycle. To the experimental research, a number of test instances of the job-shop scheduling problem by Operation Research Library were used. The Simulated Annealing was applied to solve the instances. The experiments proved that the approach proposed allows increasing the efficiency of cyclic scheduling significantly.

artificial intelligence, machine learning, scheduling problem, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICIEAM.2016.7911676

2006.10938

Country:

Asia > Russia > Siberian Federal District > Novosibirsk Oblast > Novosibirsk (0.05)
North America > United States > Massachusetts > Middlesex County > Reading (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
(2 more...)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

Gradient Descent in RKHS with Importance Labeling

Murata, Tomoya, Suzuki, Taiji

arXiv.org Machine LearningJun-18-2020

Labeling cost is often expensive and is a fundamental limitation of supervised learning. In this paper, we study importance labeling problem, in which we are given many unlabeled data and select a limited number of data to be labeled from the unlabeled data, and then a learning algorithm is executed on the selected one. We propose a new importance labeling scheme and analyse the generalization error of gradient descent combined with our labeling scheme in least squares regression in Reproducing Kernel Hilbert Spaces (RKHS). We show that the proposed importance labeling leads to much better generalization ability than uniform one under near interpolation settings. Numerical experiments verify our theoretical findings.

artificial intelligence, learning, machine learning, (15 more...)

arXiv.org Machine Learning

2006.10925

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Wisconsin > Dane County > Madison (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)

Add feedback

Stochastic Gradient Descent in Hilbert Scales: Smoothness, Preconditioning and Earlier Stopping

Mücke, Nicole, Reiss, Enrico

arXiv.org Machine LearningJun-18-2020

When solving nonparametric least-squares problems in an RKHS we face the problem that the unknown solution may not have the expected smoothness (regularity) implied by the kernel. Then the question arises whether the use of such mis-specified kernels still allows for good reconstructions yielding errors of optimal order. Although it is a commonly accepted fact that the regularity inherent in the solution has an impact on accuracy and convergence of learning algorithms, there are only poor precise mathematical investigations in the framework of learning in RKHSs using SGD. Mathematically, smoothness can be expressed in various different ways. Classically, the concept of source conditions proved to be useful, expressing the target function as element of the domain of a differential operator, see e.g.

artificial intelligence, machine learning, smoothness, (15 more...)

arXiv.org Machine Learning

2006.1084

Country: Europe > Germany > Brandenburg > Potsdam (0.04)

Genre: Research Report (0.82)

Industry: Education (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.87)

Add feedback

Improving the Convergence Rate of One-Point Zeroth-Order Optimization using Residual Feedback

Zhang, Yan, Zhou, Yi, Ji, Kaiyi, Zavlanos, Michael M.

arXiv.org Machine LearningJun-18-2020

Many existing zeroth-order optimization (ZO) algorithms adopt two-point feedback schemes due to their fast convergence rate compared to one-point feedback schemes. However, two-point schemes require two evaluations of the objective function at each iteration, which can be impractical in applications where the data are not all available a priori, e.g., in online optimization. In this paper, we propose a novel one-point feedback scheme that queries the function value only once at each iteration and estimates the gradient using the residual between two consecutive feedback points. When optimizing a deterministic Lipschitz function, we show that the query complexity of ZO with the proposed one-point residual feedback matches that of ZO with the existing two-point feedback schemes. Moreover, the query complexity of the proposed algorithm can be improved when the objective function has Lipschitz gradient. Then, for stochastic bandit optimization problems, we show that ZO with one-point residual feedback achieves the same convergence rate as that of ZO with two-point feedback with uncontrollable data samples. We demonstrate the effectiveness of the proposed one-point residual feedback via extensive numerical experiments.

artificial intelligence, inequality, machine learning, (17 more...)

arXiv.org Machine Learning

2006.1082

Country:

North America > United States > North Carolina > Durham County > Durham (0.04)
North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
North America > United States > Ohio > Franklin County > Columbus (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Add feedback

SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation

Gower, Robert M., Sebbouh, Othmane, Loizou, Nicolas

arXiv.org Machine LearningJun-18-2020

We provide several convergence theorems for SGD for two large classes of structured non-convex functions: (i) the Quasar (Strongly) Convex functions and (ii) the functions satisfying the Polyak-Lojasiewicz condition. Our analysis relies on the Expected Residual condition which we show is a strictly weaker assumption as compared to previously used growth conditions, expected smoothness or bounded variance assumptions. We provide theoretical guarantees for the convergence of SGD for different step size selections including constant, decreasing and the recently proposed stochastic Polyak step size. In addition, all of our analysis holds for the arbitrary sampling paradigm, and as such, we are able to give insights into the complexity of minibatching and determine an optimal minibatch size. In particular we recover the best known convergence rates of full gradient descent and single element sampling SGD as a special case. Finally, we show that for models that interpolate the training data, we can dispense of our Expected Residual condition and give state-of-the-art results in this setting.

artificial intelligence, assumption, machine learning, (15 more...)

arXiv.org Machine Learning

2006.10311

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Add feedback

Infinite attention: NNGP and NTK for deep attention networks

Hron, Jiri, Bahri, Yasaman, Sohl-Dickstein, Jascha, Novak, Roman

arXiv.org Machine LearningJun-18-2020

There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly initialised wide NNs optimised by gradient descent without ever running an optimiser. We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. We further discuss the effects of positional encodings and layer normalisation, and propose modifications of the attention mechanism which lead to improved results for both finite and infinitely wide NNs. We evaluate attention kernels empirically, leading to a moderate improvement upon the previous state-of-the-art on CIFAR-10 for GPs without trainable kernels and advanced data preprocessing. Finally, we introduce new features to the Neural Tangents library (Novak et al., 2020) allowing applications of NNGP/NTK models, with and without attention, to variable-length sequences, with an example on the IMDb reviews dataset.

converge, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

2006.1054

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Austria > Vienna (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(4 more...)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Neural Architecture Optimization with Graph VAE

Li, Jian, Liu, Yong, Liu, Jiankun, Wang, Weiping

arXiv.org Machine LearningJun-18-2020

Due to their high computational efficiency on a continuous space, gradient optimization methods have shown great potential in the neural architecture search (NAS) domain. The mapping of network representation from the discrete space to a latent space is the key to discovering novel architectures, however, existing gradient-based methods fail to fully characterize the networks. In this paper, we propose an efficient NAS approach to optimize network architectures in a continuous space, where the latent space is built upon variational autoencoder (VAE) and graph neural networks (GNN). The framework jointly learns four components: the encoder, the performance predictor, the complexity predictor and the decoder in an end-to-end manner. The encoder and the decoder belong to a graph VAE, mapping architectures between continuous representations and network architectures. The predictors are two regression models, fitting the performance and computational complexity, respectively. Those predictors ensure the discovered architectures characterize both excellent performance and high computational efficiency. Extensive experiments demonstrate our framework not only generates appropriate continuous representations but also discovers powerful neural architectures.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Machine Learning

2006.1031

Country: North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

A block coordinate descent optimizer for classification problems exploiting convexity

Patel, Ravi G., Trask, Nathaniel A., Gulian, Mamikon A., Cyr, Eric C.

arXiv.org Machine LearningJun-17-2020

Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting possible gains of second-order methods over gradient descent.

accuracy, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2006.10123

Country:

North America > United States (1.00)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)

Genre: Research Report > New Finding (0.66)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy (0.71)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.77)

Add feedback

Shape Matters: Understanding the Implicit Bias of the Noise Covariance

HaoChen, Jeff Z., Wei, Colin, Lee, Jason D., Ma, Tengyu

arXiv.org Machine LearningJun-17-2020

The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et el. and Woodworth et el. We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not. Code for our project is publicly available.

artificial intelligence, machine learning, noise, (15 more...)

arXiv.org Machine Learning

2006.0868

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Add feedback

Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms

Bresler, Guy, Jain, Prateek, Nagaraj, Dheeraj, Netrapalli, Praneeth, Wu, Xian

arXiv.org Machine LearningJun-16-2020

We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $\tau_{\mathsf{mix}}$, the mixing time of the underlying Markov chain, under different noise settings. Our results establish that in general, optimization with Markovian data is strictly harder than optimization with independent data and a trivial algorithm (SGD-DD) that works with only one in every $\tilde{\Theta}(\tau_{\mathsf{mix}})$ samples, which are approximately independent, is minimax optimal. In fact, it is strictly better than the popular Stochastic Gradient Descent (SGD) method with constant step-size which is otherwise minimax optimal in the regression with independent data setting. Beyond a worst case analysis, we investigate whether structured datasets seen in practice such as Gaussian auto-regressive dynamics can admit more efficient optimization schemes. Surprisingly, even in this specific and natural setting, Stochastic Gradient Descent (SGD) with constant step-size is still no better than SGD-DD. Instead, we propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate. Our improved rate serves as one of the first results where an algorithm outperforms SGD-DD on an interesting Markov chain and also provides one of the first theoretical analyses to support the use of experience replay in practice.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2006.08916

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > India > Karnataka > Bengaluru (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.76)

Add feedback