Goto

Collaborating Authors

 Gradient Descent


The cyclic job-shop scheduling problem: The new subclass of the job-shop problem and applying the Simulated annealing to solve it

arXiv.org Artificial Intelligence

In the paper, the new approach to the scheduling problem are described. The approach deals with the problem of planning the cyclic production and proposes to consider such scheduling problem as the cyclic job-shop problem of the order k, where k is the number of reiterations. It was found out that planning of only one iteration of the loop is less effective than planning of the entire cycle. To the experimental research, a number of test instances of the job-shop scheduling problem by Operation Research Library were used. The Simulated Annealing was applied to solve the instances. The experiments proved that the approach proposed allows increasing the efficiency of cyclic scheduling significantly.


Gradient Descent in RKHS with Importance Labeling

arXiv.org Machine Learning

Labeling cost is often expensive and is a fundamental limitation of supervised learning. In this paper, we study importance labeling problem, in which we are given many unlabeled data and select a limited number of data to be labeled from the unlabeled data, and then a learning algorithm is executed on the selected one. We propose a new importance labeling scheme and analyse the generalization error of gradient descent combined with our labeling scheme in least squares regression in Reproducing Kernel Hilbert Spaces (RKHS). We show that the proposed importance labeling leads to much better generalization ability than uniform one under near interpolation settings. Numerical experiments verify our theoretical findings.


Stochastic Gradient Descent in Hilbert Scales: Smoothness, Preconditioning and Earlier Stopping

arXiv.org Machine Learning

When solving nonparametric least-squares problems in an RKHS we face the problem that the unknown solution may not have the expected smoothness (regularity) implied by the kernel. Then the question arises whether the use of such mis-specified kernels still allows for good reconstructions yielding errors of optimal order. Although it is a commonly accepted fact that the regularity inherent in the solution has an impact on accuracy and convergence of learning algorithms, there are only poor precise mathematical investigations in the framework of learning in RKHSs using SGD. Mathematically, smoothness can be expressed in various different ways. Classically, the concept of source conditions proved to be useful, expressing the target function as element of the domain of a differential operator, see e.g.


Improving the Convergence Rate of One-Point Zeroth-Order Optimization using Residual Feedback

arXiv.org Machine Learning

Many existing zeroth-order optimization (ZO) algorithms adopt two-point feedback schemes due to their fast convergence rate compared to one-point feedback schemes. However, two-point schemes require two evaluations of the objective function at each iteration, which can be impractical in applications where the data are not all available a priori, e.g., in online optimization. In this paper, we propose a novel one-point feedback scheme that queries the function value only once at each iteration and estimates the gradient using the residual between two consecutive feedback points. When optimizing a deterministic Lipschitz function, we show that the query complexity of ZO with the proposed one-point residual feedback matches that of ZO with the existing two-point feedback schemes. Moreover, the query complexity of the proposed algorithm can be improved when the objective function has Lipschitz gradient. Then, for stochastic bandit optimization problems, we show that ZO with one-point residual feedback achieves the same convergence rate as that of ZO with two-point feedback with uncontrollable data samples. We demonstrate the effectiveness of the proposed one-point residual feedback via extensive numerical experiments.


SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation

arXiv.org Machine Learning

We provide several convergence theorems for SGD for two large classes of structured non-convex functions: (i) the Quasar (Strongly) Convex functions and (ii) the functions satisfying the Polyak-Lojasiewicz condition. Our analysis relies on the Expected Residual condition which we show is a strictly weaker assumption as compared to previously used growth conditions, expected smoothness or bounded variance assumptions. We provide theoretical guarantees for the convergence of SGD for different step size selections including constant, decreasing and the recently proposed stochastic Polyak step size. In addition, all of our analysis holds for the arbitrary sampling paradigm, and as such, we are able to give insights into the complexity of minibatching and determine an optimal minibatch size. In particular we recover the best known convergence rates of full gradient descent and single element sampling SGD as a special case. Finally, we show that for models that interpolate the training data, we can dispense of our Expected Residual condition and give state-of-the-art results in this setting.


Infinite attention: NNGP and NTK for deep attention networks

arXiv.org Machine Learning

There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly initialised wide NNs optimised by gradient descent without ever running an optimiser. We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. We further discuss the effects of positional encodings and layer normalisation, and propose modifications of the attention mechanism which lead to improved results for both finite and infinitely wide NNs. We evaluate attention kernels empirically, leading to a moderate improvement upon the previous state-of-the-art on CIFAR-10 for GPs without trainable kernels and advanced data preprocessing. Finally, we introduce new features to the Neural Tangents library (Novak et al., 2020) allowing applications of NNGP/NTK models, with and without attention, to variable-length sequences, with an example on the IMDb reviews dataset.


Neural Architecture Optimization with Graph VAE

arXiv.org Machine Learning

Due to their high computational efficiency on a continuous space, gradient optimization methods have shown great potential in the neural architecture search (NAS) domain. The mapping of network representation from the discrete space to a latent space is the key to discovering novel architectures, however, existing gradient-based methods fail to fully characterize the networks. In this paper, we propose an efficient NAS approach to optimize network architectures in a continuous space, where the latent space is built upon variational autoencoder (VAE) and graph neural networks (GNN). The framework jointly learns four components: the encoder, the performance predictor, the complexity predictor and the decoder in an end-to-end manner. The encoder and the decoder belong to a graph VAE, mapping architectures between continuous representations and network architectures. The predictors are two regression models, fitting the performance and computational complexity, respectively. Those predictors ensure the discovered architectures characterize both excellent performance and high computational efficiency. Extensive experiments demonstrate our framework not only generates appropriate continuous representations but also discovers powerful neural architectures.


A block coordinate descent optimizer for classification problems exploiting convexity

arXiv.org Machine Learning

Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting possible gains of second-order methods over gradient descent.


Shape Matters: Understanding the Implicit Bias of the Noise Covariance

arXiv.org Machine Learning

The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect for training overparameterized models. Prior theoretical work largely focuses on spherical Gaussian noise, whereas empirical studies demonstrate the phenomenon that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise. This paper theoretically characterizes this phenomenon on a quadratically-parameterized model introduced by Vaskevicius et el. and Woodworth et el. We show that in an over-parameterized setting, SGD with label noise recovers the sparse ground-truth with an arbitrary initialization, whereas SGD with Gaussian noise or gradient descent overfits to dense solutions with large norms. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not. Code for our project is publicly available.


Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms

arXiv.org Machine Learning

We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $\tau_{\mathsf{mix}}$, the mixing time of the underlying Markov chain, under different noise settings. Our results establish that in general, optimization with Markovian data is strictly harder than optimization with independent data and a trivial algorithm (SGD-DD) that works with only one in every $\tilde{\Theta}(\tau_{\mathsf{mix}})$ samples, which are approximately independent, is minimax optimal. In fact, it is strictly better than the popular Stochastic Gradient Descent (SGD) method with constant step-size which is otherwise minimax optimal in the regression with independent data setting. Beyond a worst case analysis, we investigate whether structured datasets seen in practice such as Gaussian auto-regressive dynamics can admit more efficient optimization schemes. Surprisingly, even in this specific and natural setting, Stochastic Gradient Descent (SGD) with constant step-size is still no better than SGD-DD. Instead, we propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate. Our improved rate serves as one of the first results where an algorithm outperforms SGD-DD on an interesting Markov chain and also provides one of the first theoretical analyses to support the use of experience replay in practice.