AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Joint Learning of Energy-based Models and their Partition Function

Sander, Michael E., Roulet, Vincent, Liu, Tianlin, Blondel, Mathieu

arXiv.org Machine LearningJan-30-2025

Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks. However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function (normalization constant). In this paper, we propose a novel formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, both parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points. On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions. Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtain the first tractable method for optimizing the sparsemax loss in combinatorially-large spaces. We demonstrate our approach on multilabel classification and label ranking.

artificial intelligence, bayesian inference, machine learning, (19 more...)

arXiv.org Machine Learning

2501.18528

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

Temperature-Free Loss Function for Contrastive Learning

Kim, Bum Jun, Kim, Sang Woo

arXiv.org Artificial IntelligenceJan-29-2025

As one of the most promising methods in self-supervised learning, contrastive learning has achieved a series of breakthroughs across numerous fields. A predominant approach to implementing contrastive learning is applying InfoNCE loss: By capturing the similarities between pairs, InfoNCE loss enables learning the representation of data. Albeit its success, adopting InfoNCE loss requires tuning a temperature, which is a core hyperparameter for calibrating similarity scores. Despite its significance and sensitivity to performance being emphasized by several studies, searching for a valid temperature requires extensive trial-and-error-based experiments, which increases the difficulty of adopting InfoNCE loss. To address this difficulty, we propose a novel method to deploy InfoNCE loss without temperature. Specifically, we replace temperature scaling with the inverse hyperbolic tangent function, resulting in a modified InfoNCE loss. In addition to hyperparameter-free deployment, we observed that the proposed method even yielded a performance gain in contrastive learning. Our detailed theoretical analysis discovers that the current practice of temperature scaling in InfoNCE loss causes serious problems in gradient descent, whereas our method provides desirable gradient properties. The proposed method was validated on five benchmarks on contrastive learning, yielding satisfactory results without temperature tuning.

artificial intelligence, infonce loss, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.17683

Country: Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)

Genre: Research Report > Promising Solution (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

When less is more: evolving large neural networks from small ones

Radhakrishnan, Anil, Lindner, John F., Miller, Scott T., Sinha, Sudeshna, Ditto, William L.

arXiv.org Artificial IntelligenceJan-29-2025

In contrast to conventional artificial neural networks, which are large and structurally static, we study feed-forward neural networks that are small and dynamic, whose nodes can be added (or subtracted) during training. A single neuronal weight in the network controls the network's size, while the weight itself is optimized by the same gradient-descent algorithm that optimizes the network's other weights and biases, but with a size-dependent objective or loss function. We train and evaluate such Nimble Neural Networks on nonlinear regression and classification tasks where they outperform the corresponding static networks. Growing networks to minimal, appropriate, or optimal sizes while training elucidates network dynamics and contrasts with pruning large networks after training but before deployment.

artificial intelligence, machine learning, neural network, (17 more...)

arXiv.org Artificial Intelligence

2501.18012

Country:

North America > United States > Ohio > Wayne County > Wooster (0.04)
North America > United States > North Carolina > Wake County > Raleigh (0.04)
Asia > India > Punjab (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.37)

Add feedback

Contextually Entangled Gradient Mapping for Optimized LLM Comprehension

Sisate, Colin, Goldfinch, Alistair, Waterstone, Vincent, Kingsley, Sebastian, Blackthorn, Mariana

arXiv.org Artificial IntelligenceJan-28-2025

Contextually Entangled Gradient Mapping (CEGM) introduces a new approach to gradient optimization, redefining the relationship between contextual embeddings and gradient updates to enhance semantic coherence and reasoning capabilities in neural architectures. By treating gradients as dynamic carriers of contextual dependencies rather than isolated numerical entities, the proposed methodology bridges critical gaps in existing optimization strategies. The integration of entangled gradient dynamics into a loss regularization framework demonstrated significant improvements in tasks involving long-form reasoning, contextual retention, and adaptability to unseen domains. Experimental evaluations showed that the CEGM-enhanced model consistently outperformed baseline approaches, achieving higher accuracy in token-level predictions and greater resilience to noisy inputs. Practical implementations involved modifications to training pipelines, introducing entanglement layers and dynamic coefficient adjustments that seamlessly align with existing architectures. Results further highlighted reductions in semantic drift during sequential transformations and improvements in embedding coherence across paraphrased sentences, showing the robustness and versatility of the proposed methodology. The findings demonstrate the broader implications of gradient entanglement for both theoretical advancements and practical applications in optimization strategies.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.00048

Country: Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Convergence of two-timescale gradient descent ascent dynamics: finite-dimensional and mean-field perspectives

An, Jing, Lu, Jianfeng

arXiv.org Artificial IntelligenceJan-28-2025

The two-timescale gradient descent-ascent (GDA) is a canonical gradient algorithm designed to find Nash equilibria in min-max games. We analyze the two-timescale GDA by investigating the effects of learning rate ratios on convergence behavior in both finite-dimensional and mean-field settings. In particular, for finite-dimensional quadratic min-max games, we obtain long-time convergence in near quasi-static regimes through the hypocoercivity method. For mean-field GDA dynamics, we investigate convergence under a finite-scale ratio using a mixed synchronous-reflection coupling technique.

artificial intelligence, convergence result, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2501.17122

Country:

Asia > Middle East > Jordan (0.04)
Europe > Sweden (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback

Review for NeurIPS paper: A Contour Stochastic Gradient Langevin Dynamics Algorithm for Simulations of Multi-modal Distributions

Neural Information Processing SystemsJan-27-2025, 18:59:37 GMT

My main concern is that using a flattened surrogate energy in this fashion is suitable for most sampling situations. The main reason is, by construction our iterates are not following the true distribution particularly closely; for example a plot of the samples obtained in the synthetic experiments (figs 2c--d) would look quite different from the original. While this does allow the algorithm to bounce out of local optima, the deviance from the true energy would make samples obtained after convergence to not be super useful. For point estimation situations, we might be able to get away with these samples for cases where the multiple modes of the real energy are sort of symmetric (as in the synthetic Gaussian experiments); it seems that even if we use a'flattened' energy (can be thought of as lower peaks with higher elevation between them), the original distribution's symmetry would be essentially preserved and the mean / other point estimates would be close enough. But flattening energies with skewed distribution of modes might not be as accurate, as the flattened version might have a mean closer to the'center' of the space, but the original would be closer to one of the modes near the periphery (am visualizing a simple 2-d space).

multi-modal distribution, neurips paper, stochastic gradient langevin dynamic algorithm, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Review for NeurIPS paper: A Contour Stochastic Gradient Langevin Dynamics Algorithm for Simulations of Multi-modal Distributions

Neural Information Processing SystemsJan-27-2025, 18:59:31 GMT

The paper presents valuable theoretical and empirical evidence for a novel algorithm. The AC is confident this represents valuable work but was a bit torn about the acceptance decision, as the reviewers point out several important avenues where improvement in the paper is needed.

multi-modal distribution, neurips paper, stochastic gradient langevin dynamic algorithm, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Why are Adaptive Methods Good for Attention Models?

Neural Information Processing SystemsJan-27-2025, 16:02:50 GMT

While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clipping plays a key role in addressing heavy-tailed gradient noise.

adaptive method good, attention model, noise, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)

Add feedback

Reviews: A Universally Optimal Multistage Accelerated Stochastic Gradient Method

Neural Information Processing SystemsJan-27-2025, 09:16:03 GMT

Originality: This paper provides a clear and deep analysis of a multi-stage accelerated SGD algorithm. The results show that the expected function value gap is bounded by an exponential decay term plus a sublinear decay term related to noise. They recover the deterministic case in the single stage and zero noise special case, while reaching the lower bound O(\sigma 2/n) in the noise term. The paper contains sufficient novel results and is competitive comparing with related work. In particular, the main results reveal how to choose the right time to switch from constant stepsize to decaying stepsize, a crucial choice for the overall performance of stochastic algorithms.

algorithm, function value gap, multistage accelerated stochastic gradient method, (3 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.41)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.44)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.44)

Add feedback

Reviews: A Universally Optimal Multistage Accelerated Stochastic Gradient Method

Neural Information Processing SystemsJan-27-2025, 09:15:53 GMT

This paper designs a multistage SGD algorithm that does not need to know noise and optimality gap at initialization and yet obtain optimal convergence rates. This is a well written paper with good results.

multistage accelerated stochastic gradient method

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback