AITopics

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > Middle East > Cyprus > Nicosia > Nicosia (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsFeb-8-2026, 20:44:19 GMT

5b4e9aa703d0bfa11041debaa2d1b633-Supplemental.pdf

For the results shown in Figure 5 (bottom), the function codes and signatures were trained for 10 epochs on 8192 samples with Shampoo [25]. We again used 6 random seeds, and for each set of trainable parameters, we grid-searched the learning rate.

artificial intelligence, machine learning, validation performance, (15 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.69)

Neural Information Processing SystemsFeb-7-2026, 17:53:41 GMT

0be50b4590f1c5fdf4c8feddd63c4f67-Supplemental-Datasets_and_Benchmarks.pdf

In Figure 1 we demonstrate the common neighbor (CN) distribution among positive and negative test samples for ogbl-collab, ogbl-ppa, and ogbl-citation2. These results demonstrate that a vast majority of negative samples have no CNs. Since CNs is a typically good heuristic, this makes it easy to identify most negative samples. We further present the CN distribution of Cora, Citeseer, Pubmed, and ogbl-ddi in Figure 3. The CN distribution of Cora, Citeseer, and Pubmed are consistent with our previous observations on the OGB datasets in Figure 1. We note that ogbl-ddi exhibits a different distribution with other datasets. As compared to the other datasets, most of the negative samples in ogbl-ddi have common neighbors. This is likely because ogbl-ddi is considerably denser than the other graphs.

artificial intelligence, machine learning, negative sample, (18 more...)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceDec-2-2025

Soft Adaptive Policy Optimization

Gao, Chang, Zheng, Chujie, Chen, Xiong-Hui, Dang, Kai, Liu, Shixuan, Yu, Bowen, Yang, An, Bai, Shuai, Zhou, Jingren, Lin, Junyang

Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.

large language model, machine learning, natural language, (14 more...)

2511.20347

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Sapkota, Rupesh, Demir, Caglar, Sharma, Arnab, Ngomo, Axel-Cyrille Ngonga

Parameter Averaging in Link Prediction

arXiv.org Artificial IntelligenceOct-30-2025

Ensemble methods are widely employed to improve generalization in machine learning. This has also prompted the adoption of ensemble learning for the knowledge graph embedding (KGE) models in performing link prediction. Typical approaches to this end train multiple models as part of the ensemble, and the diverse predictions are then averaged. However, this approach has some significant drawbacks. For instance, the computational overhead of training multiple models increases latency and memory overhead. In contrast, model merging approaches offer a promising alternative that does not require training multiple models. In this work, we introduce model merging, specifically weighted averaging, in KGE models. Herein, a running average of model parameters from a training epoch onward is maintained and used for predictions. To address this, we additionally propose an approach that selectively updates the running average of the ensemble model parameters only when the generalization performance improves on a validation dataset. We evaluate these two different weighted averaging approaches on link prediction tasks, comparing the state-of-the-art benchmark ensemble approach. Additionally, we evaluate the weighted averaging approach considering literal-augmented KGE models and multi-hop query answering tasks as well. The results demonstrate that the proposed weighted averaging approach consistently improves performance across diverse evaluation settings.

data mining, machine learning, natural language, (16 more...)

doi: 10.1145/3731443.3771365

2510.25361

Country:

North America > United States (0.48)
Europe > Germany > North Rhine-Westphalia (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Neural Information Processing SystemsOct-10-2025, 01:07:56 GMT

Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization

Hyperparameter optimization is crucial for obtaining peak performance of machine learning models.

holdout, shaded area represent standard error, test performance, (11 more...)

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > Middle East > Cyprus > Nicosia > Nicosia (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceOct-10-2025

Training a Foundation Model for Materials on a Budget

Koker, Teddy, Kotak, Mit, Smidt, Tess

Foundation models for materials modeling are advancing quickly, but their training remains expensive, often placing state-of-the-art methods out of reach for many research groups. We introduce Nequix, a compact E(3)-equivariant potential that pairs a simplified NequIP design with modern training practices, including equivariant root-mean-square layer normalization and the Muon optimizer, to retain accuracy while substantially reducing compute requirements. Nequix has 700K parameters and was trained in 100 A100 GPU-hours. On the Matbench-Discovery and MDR Phonon benchmarks, Nequix ranks third overall while requiring a 20 times lower training cost than most other methods, and it delivers two orders of magnitude faster inference speed than the current top-ranked model.

artificial intelligence, interatomic potential, machine learning, (15 more...)

2508.16067

Country: North America > United States > Massachusetts (0.28)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.97)

Sampson, Matt L., Melchior, Peter

Dynamics of Learning: Generative Schedules from Latent ODEs

arXiv.org Artificial IntelligenceSep-30-2025

The learning rate schedule is one of the most impactful aspects of neural network optimization, yet most schedules either follow simple parametric functions or react only to short-term training signals. None of them are supported by a comprehensive temporal view of how well neural networks actually train. We present a new learning rate scheduler that models the training performance of neural networks as a dynamical system. It leverages training runs from a hyperparameter search to learn a latent representation of the training process. Given current training metrics, it predicts the future learning rate schedule with the best long-term validation performance. Our scheduler generalizes beyond previously observed training dynamics and creates specialized schedules that deviate noticeably from common parametric functions. It achieves SOTA results for image classification with CNN and ResNet models as well as for next-token prediction with a transformer model. The trained models are located in flatter regions of the loss landscape and thus provide better generalization than those trained with other schedules. Our method is computationally efficient, optimizer-agnostic, and can easily be layered on top of ML experiment-tracking platforms. An implementation of our scheduler will be made available after acceptance.

accuracy, artificial intelligence, machine learning, (16 more...)

2509.23052

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceSep-25-2025

Soft Tokens, Hard Truths

Butt, Natasha, Kwiatkowski, Ariel, Labiad, Ismail, Kempe, Julia, Ollivier, Yann

The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.

large language model, machine learning, natural language, (21 more...)

2509.1917

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)