AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning

Neural Information Processing SystemsJan-18-2025, 04:42:04 GMT

The surprising discovery of the BYOL method shows the negative samples can be replaced by adding the prediction head to the network. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we characterized the substitution effect and acceleration effect of the trainable, but identity-initialized prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head.

neural network, non-contrastive self-supervised learning, prediction head, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.44)

Add feedback

Scalable Kernel Methods via Doubly Stochastic Gradients

Neural Information Processing SystemsJan-18-2025, 03:41:46 GMT

The general perception is that kernel methods are not scalable, so neural nets become the choice for large-scale nonlinear learning problems. Have we tried hard enough for kernel methods? In this paper, we propose an approach that scales up kernel methods using a novel concept called doubly stochastic functional gradients''. Based on the fact that many kernel methods can be expressed as convex optimization problems, our approach solves the optimization problems by making two unbiased stochastic approximations to the functional gradient---one using random training points and another using random features associated with the kernel---and performing descent steps with this noisy functional gradient. Our algorithm is simple, need no commit to a preset number of random features, and allows the flexibility of the function class to grow as we see more incoming data in the streaming setting.

doubly stochastic gradient, kernel method, scalable kernel method, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Kernel Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Conflict-Averse Gradient Descent for Multi-task learning

Neural Information Processing SystemsJan-18-2025, 03:41:08 GMT

The goal of multi-task learning is to enable more efficient learning than single task learning by sharing model structures for a diverse set of tasks. A standard multi-task learning objective is to minimize the average loss across all tasks. While straightforward, using this objective often results in much worse final performance for each task than learning them independently. A major challenge in optimizing a multi-task model is the conflicting gradients, where gradients of different task objectives are not well aligned so that following the average gradient direction can be detrimental to specific tasks' performance. Previous work has proposed several heuristics to manipulate the task gradients for mitigating this problem.

average loss, conflict-averse gradient descent, multi-task, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.52)

Add feedback

Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance

Neural Information Processing SystemsJan-18-2025, 03:40:47 GMT

Recent studies have provided both empirical and theoretical evidence illustrating that heavy tails can emerge in stochastic gradient descent (SGD) in various scenarios. Such heavy tails potentially result in iterates with diverging variance, which hinders the use of conventional convergence analysis techniques that rely on the existence of the second-order moments. In this paper, we provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance, for a class of strongly convex objectives. In the case where the p -th moment of the noise exists for some p\in [1,2), we first identify a condition on the Hessian, coined p -positive (semi-)definiteness', that leads to an interesting interpolation between the positive semi-definite cone ( p 2) and the cone of diagonally dominant matrices with non-negative diagonal entries ( p 1). Under this condition, we provide a convergence rate for the distance to the global optimum in L p .

convergence rate, infinite noise variance, stochastic gradient descent, (4 more...)

Neural Information Processing Systems

Genre: Research Report (0.41)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Generalization Properties of NAS under Activation and Skip Connection Search

Neural Information Processing SystemsJan-17-2025, 23:01:55 GMT

Neural Architecture Search (NAS) has fostered the automatic discovery of state-of-the-art neural architectures. Despite the progress achieved with NAS, so far there is little attention to theoretical guarantees on NAS. In this work, we study the generalization properties of NAS under a unifying framework enabling (deep) layer skip connection search and activation function search. To this end, we derive the lower (and upper) bounds of the minimum eigenvalue of the Neural Tangent Kernel (NTK) under the (in)finite-width regime using a certain search space including mixed activation functions, fully connected, and residual neural networks. We use the minimum eigenvalue to establish generalization error bounds of NAS in the stochastic gradient descent training.

activation and skip connection search, generalization property, minimum eigenvalue, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization

Neural Information Processing SystemsJan-17-2025, 23:01:48 GMT

We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most \varepsilon (compared to the best possible with unit Euclidean norm) with an optimal, up to logarithmic factors, sample complexity of \tilde{O}(1/\varepsilon 2) and only \tilde{O}(1/\varepsilon 2) iterations. This contrasts with general stochastic convex optimization, where \Omega(1/\varepsilon 4) iterations are needed Amir et al. 2021. The lower iteration complexity is ensured by leveraging uniform convergence rather than stability. But instead of uniform convergence in a norm ball, which we show can guarantee suboptimal learning using \Theta(1/\varepsilon 4) samples, we rely on uniform convergence in a distribution-dependent ball.

generalized linear stochastic convex optimization, gradient descent, varepsilon, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.40)

Add feedback

Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

Neural Information Processing SystemsJan-17-2025, 21:33:56 GMT

Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad \& Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.

high-dimensional two-layer neural network, phase diagram, stochastic gradient descent

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

ErrorCompensatedX: error compensation for variance reduced algorithms

Neural Information Processing SystemsJan-17-2025, 18:43:36 GMT

Communication cost is one major bottleneck for the scalability for distributed learning. One approach to reduce the communication cost is to compress the gradient during communication. However, directly compressing the gradient decelerates the convergence speed, and the resulting algorithm may diverge for biased compression. Recent work addressed this problem for stochastic gradient descent by adding back the compression error from the previous step. This idea was further extended to one class of variance reduced algorithms, where the variance of the stochastic gradient is reduced by taking a moving average over all history gradients.

algorithm, error compensation, variance, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Add feedback

Stochastic Multiple Target Sampling Gradient Descent

Neural Information Processing SystemsJan-17-2025, 17:15:35 GMT

Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimization problem and can be viewed as a probabilistic version of this single-objective optimization problem. A natural question then arises: Can we derive a probabilistic version of the multi-objective optimization?''. To answer this question, we propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), enabling us to sample from multiple unnormalized target distributions.

multiple target sampling gradient descent, single-objective optimization problem, target distribution, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)

Add feedback

The Implicit Bias of Minima Stability: A View from Function Space

Neural Information Processing SystemsJan-17-2025, 14:18:49 GMT

The loss terrains of over-parameterized neural networks have multiple global minima. However, it is well known that stochastic gradient descent (SGD) can stably converge only to minima that are sufficiently flat w.r.t. In this paper we study the effect that this mechanism has on the function implemented by the trained model. First, we extend the existing knowledge on minima stability to non-differentiable minima, which are common in ReLU nets. We then use our stability results to study a single hidden layer univariate ReLU network.

artificial intelligence, machine learning, minima stability, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.63)

Add feedback