Goto

Collaborating Authors

 Gradient Descent


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper proposes to use random functions as an approximation to kernel functions and then proposes to do stochastic gradient descent. Convergence rates and generalisation bounds are derived. Experimental results on large datasets are presented. The idea of introducing random functions to approximate kernel functions and then using SGD is very interesting.


Scalable Kernel Methods via Doubly Stochastic Gradients

Neural Information Processing Systems

The general perception is that kernel methods are not scalable, so neural nets become the choice for large-scale nonlinear learning problems. Have we tried hard enough for kernel methods? In this paper, we propose an approach that scales up kernel methods using a novel concept called " doubly stochastic functional gradients ". Based on the fact that many kernel methods can be expressed as convex optimization problems, our approach solves the optimization problems by making two unbiased stochastic approximations to the functional gradient--one using random training points and another using random features associated with the kernel--and performing descent steps with this noisy functional gradient. Our algorithm is simple, need no commit to a preset number of random features, and allows the flexibility of the function class to grow as we see more incoming data in the streaming setting. We demonstrate that a function learned by this procedure after t iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O (1/t), and achieves a generalization bound of O (1 / t). Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show competitive performances of our approach as compared to neural nets in datasets such as 2.3 million energy materials from MolecularSpace, 8 million handwritten digits from MNIST, and 1 million photos from ImageNet using convolution features.



Learning to Learn By Self-Critique

Neural Information Processing Systems

In few-shot learning, a machine learning system learns from a small set of labelled examples relating to a specific task, such that it can generalize to new examples of the same task. Given the limited availability of labelled examples in such tasks, we wish to make use of all the information we can. Usually a model learns task-specific information from a small training-set ( support-set) to predict on an unlabelled validation set ( target-set). The target-set contains additional task-specific information which is not utilized by existing few-shot learning methods. Making use of the target-set examples via transductive learning requires approaches beyond the current methods; at inference time, the target-set contains only unlabelled input data-points, and so discriminative learning cannot be used. In this paper, we propose a framework called Self-Critique and Adapt or SCA, which learns to learn an label-free loss function, parameterized as a neural network. A base-model learns on a support-set using existing methods (e.g.


The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies

Neural Information Processing Systems

We study the relationship between the frequency of a function and the speed at which a neural network learns it. We build on recent results that show that the dynamics of overparameterized neural networks trained with gradient descent can be well approximated by a linear system. When normalized training data is uniformly distributed on a hypersphere, the eigenfunctions of this linear system are spherical harmonic functions.


Probabilistic low-rank matrix completion on finite alphabets

Neural Information Processing Systems

The task of reconstructing a matrix given a sample of observed entries is known as the matrix completion problem . It arises in a wide range of problems, including recommender systems, collaborative filtering, dimensionality reduction, image processing, quantum physics or multi-class classification to name a few. Most works have focused on recovering an unknown real-valued low-rank matrix from randomly sub-sampling its entries. Here, we investigate the case where the observations take a finite number of values, corresponding for examples to ratings in recommender systems or labels in multi-class classification. We also consider a general sampling scheme (not necessarily uniform) over the matrix entries. The performance of a nuclear-norm penalized estimator is analyzed theoretically. More precisely, we derive bounds for the Kullback-Leibler divergence between the true and estimated distributions. In practice, we have also proposed an efficient algorithm based on lifted coordinate gradient descent in order to tackle potentially high dimensional settings.


On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms Lam M. Nguyen

Neural Information Processing Systems

Stochastic gradient descent (SGD) algorithm is the method of choice in many machine learning tasks thanks to its scalability and efficiency in dealing with large-scale problems. In this paper, we focus on the shuffling version of SGD which matches the mainstream practical heuristics. We show the convergence to a global solution of shuffling SGD for a class of non-convex functions under over-parameterized settings.


Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory

Neural Information Processing Systems

Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon.


Tight Risk Bounds for Gradient Descent on Separable Data

Neural Information Processing Systems

Recently, there has been a marked increase in interest regarding the generalization capabilities of unregularized gradient-based learning methods.


Appendix: On the Overlooked Structure of Stochastic Gradients

Neural Information Processing Systems

Avila is a non-image dataset. A.3 Image classification on MNIST We perform the common per-pixel zero-mean unit-variance normalization as data preprocessing for MNIST. Pretraining Hyperparameter Settings: We train neural networks for 50 epochs on MNIST for obtaining pretrained models. The batch size is set to 1 and no weight decay is used, unless we specify them otherwise. As for other optimizer hyperparameters, we apply the default settings directly.