Gradient Descent
Joint Learning of Energy-based Models and their Partition Function
Sander, Michael E., Roulet, Vincent, Liu, Tianlin, Blondel, Mathieu
Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks. However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function (normalization constant). In this paper, we propose a novel formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, both parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points. On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions. Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtain the first tractable method for optimizing the sparsemax loss in combinatorially-large spaces. We demonstrate our approach on multilabel classification and label ranking.
Temperature-Free Loss Function for Contrastive Learning
As one of the most promising methods in self-supervised learning, contrastive learning has achieved a series of breakthroughs across numerous fields. A predominant approach to implementing contrastive learning is applying InfoNCE loss: By capturing the similarities between pairs, InfoNCE loss enables learning the representation of data. Albeit its success, adopting InfoNCE loss requires tuning a temperature, which is a core hyperparameter for calibrating similarity scores. Despite its significance and sensitivity to performance being emphasized by several studies, searching for a valid temperature requires extensive trial-and-error-based experiments, which increases the difficulty of adopting InfoNCE loss. To address this difficulty, we propose a novel method to deploy InfoNCE loss without temperature. Specifically, we replace temperature scaling with the inverse hyperbolic tangent function, resulting in a modified InfoNCE loss. In addition to hyperparameter-free deployment, we observed that the proposed method even yielded a performance gain in contrastive learning. Our detailed theoretical analysis discovers that the current practice of temperature scaling in InfoNCE loss causes serious problems in gradient descent, whereas our method provides desirable gradient properties. The proposed method was validated on five benchmarks on contrastive learning, yielding satisfactory results without temperature tuning.
When less is more: evolving large neural networks from small ones
Radhakrishnan, Anil, Lindner, John F., Miller, Scott T., Sinha, Sudeshna, Ditto, William L.
In contrast to conventional artificial neural networks, which are large and structurally static, we study feed-forward neural networks that are small and dynamic, whose nodes can be added (or subtracted) during training. A single neuronal weight in the network controls the network's size, while the weight itself is optimized by the same gradient-descent algorithm that optimizes the network's other weights and biases, but with a size-dependent objective or loss function. We train and evaluate such Nimble Neural Networks on nonlinear regression and classification tasks where they outperform the corresponding static networks. Growing networks to minimal, appropriate, or optimal sizes while training elucidates network dynamics and contrasts with pruning large networks after training but before deployment.
Contextually Entangled Gradient Mapping for Optimized LLM Comprehension
Sisate, Colin, Goldfinch, Alistair, Waterstone, Vincent, Kingsley, Sebastian, Blackthorn, Mariana
Contextually Entangled Gradient Mapping (CEGM) introduces a new approach to gradient optimization, redefining the relationship between contextual embeddings and gradient updates to enhance semantic coherence and reasoning capabilities in neural architectures. By treating gradients as dynamic carriers of contextual dependencies rather than isolated numerical entities, the proposed methodology bridges critical gaps in existing optimization strategies. The integration of entangled gradient dynamics into a loss regularization framework demonstrated significant improvements in tasks involving long-form reasoning, contextual retention, and adaptability to unseen domains. Experimental evaluations showed that the CEGM-enhanced model consistently outperformed baseline approaches, achieving higher accuracy in token-level predictions and greater resilience to noisy inputs. Practical implementations involved modifications to training pipelines, introducing entanglement layers and dynamic coefficient adjustments that seamlessly align with existing architectures. Results further highlighted reductions in semantic drift during sequential transformations and improvements in embedding coherence across paraphrased sentences, showing the robustness and versatility of the proposed methodology. The findings demonstrate the broader implications of gradient entanglement for both theoretical advancements and practical applications in optimization strategies.
Convergence of two-timescale gradient descent ascent dynamics: finite-dimensional and mean-field perspectives
The two-timescale gradient descent-ascent (GDA) is a canonical gradient algorithm designed to find Nash equilibria in min-max games. We analyze the two-timescale GDA by investigating the effects of learning rate ratios on convergence behavior in both finite-dimensional and mean-field settings. In particular, for finite-dimensional quadratic min-max games, we obtain long-time convergence in near quasi-static regimes through the hypocoercivity method. For mean-field GDA dynamics, we investigate convergence under a finite-scale ratio using a mixed synchronous-reflection coupling technique.
Review for NeurIPS paper: A Contour Stochastic Gradient Langevin Dynamics Algorithm for Simulations of Multi-modal Distributions
My main concern is that using a flattened surrogate energy in this fashion is suitable for most sampling situations. The main reason is, by construction our iterates are not following the true distribution particularly closely; for example a plot of the samples obtained in the synthetic experiments (figs 2c--d) would look quite different from the original. While this does allow the algorithm to bounce out of local optima, the deviance from the true energy would make samples obtained after convergence to not be super useful. For point estimation situations, we might be able to get away with these samples for cases where the multiple modes of the real energy are sort of symmetric (as in the synthetic Gaussian experiments); it seems that even if we use a'flattened' energy (can be thought of as lower peaks with higher elevation between them), the original distribution's symmetry would be essentially preserved and the mean / other point estimates would be close enough. But flattening energies with skewed distribution of modes might not be as accurate, as the flattened version might have a mean closer to the'center' of the space, but the original would be closer to one of the modes near the periphery (am visualizing a simple 2-d space).
Why are Adaptive Methods Good for Attention Models?
While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clipping plays a key role in addressing heavy-tailed gradient noise.
Reviews: A Universally Optimal Multistage Accelerated Stochastic Gradient Method
Originality: This paper provides a clear and deep analysis of a multi-stage accelerated SGD algorithm. The results show that the expected function value gap is bounded by an exponential decay term plus a sublinear decay term related to noise. They recover the deterministic case in the single stage and zero noise special case, while reaching the lower bound O(\sigma 2/n) in the noise term. The paper contains sufficient novel results and is competitive comparing with related work. In particular, the main results reveal how to choose the right time to switch from constant stepsize to decaying stepsize, a crucial choice for the overall performance of stochastic algorithms.