AITopics

Neural Information Processing SystemsAug-14-2025, 04:22:44 GMT

318f3ae8be3c97cb7555e1c932f472a1-Paper-Conference.pdf

language command, learning, win rate, (12 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > Czechia > Prague (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment > Games (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Neural Information Processing SystemsAug-14-2025, 03:38:00 GMT

34f98c7c5d7063181da890ea8d25265a-Paper.pdf

approximation, assumption, function approximation, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Human-Aligned Procedural Level Generation Reinforcement Learning via Text-Level-Sketch Shared Representation

Baek, In-Chang, Lee, Seoyoung, Kim, Sung-Hyun, Hwang, Geumhwan, Kim, KyungJoong

Human-aligned AI is a critical component of co-creativity, as it enables models to accurately interpret human intent and generate controllable outputs that align with design goals in collaborative content creation. This direction is especially relevant in procedural content generation via reinforcement learning (PCGRL), which is intended to serve as a tool for human designers. However, existing systems often fall short of exhibiting human-centered behavior, limiting the practical utility of AI-driven generation tools in real-world design workflows. In this paper, we propose VIPCGRL (Vision-Instruction PCGRL), a novel deep reinforcement learning framework that incorporates three modalities-text, level, and sketches-to extend control modality and enhance human-likeness. We introduce a shared embedding space trained via quadruple contrastive learning across modalities and human-AI styles, and align the policy using an auxiliary reward based on embedding similarity. Experimental results show that VIPCGRL outperforms existing baselines in human-likeness, as validated by both quantitative metrics and human evaluations. The code and dataset will be available upon publication.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2508.0986

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Goal Discovery with Causal Capacity for Efficient Reinforcement Learning

Yu, Yan, Yang, Yaodong, Lu, Zhengbo, Ma, Chengdong, Zhou, Wengang, Li, Houqiang

Causal inference is crucial for humans to explore the world, which can be modeled to enable an agent to efficiently explore the environment in reinforcement learning. Existing research indicates that establishing the causality between action and state transition will enhance an agent to reason how a policy affects its future trajectory, thereby promoting directed exploration. However, it is challenging to measure the causality due to its intractability in the vast state-action space of complex scenarios. In this paper, we propose a novel Goal Discovery with Causal Capacity (GDCC) framework for efficient environment exploration. Specifically, we first derive a measurement of causality in state space, \emph{i.e.,} causal capacity, which represents the highest influence of an agent's behavior on future trajectories. After that, we present a Monte Carlo based method to identify critical points in discrete state space and further optimize this method for continuous high-dimensional environments. Those critical points are used to uncover where the agent makes important decisions in the environment, which are then regarded as our subgoals to guide the agent to make exploration more purposefully and efficiently. Empirical results from multi-objective tasks demonstrate that states with high causal capacity align with our expected subgoals, and our GDCC achieves significant success rate improvements compared to baselines.

causal capacity, machine learning, reinforcement learning, (13 more...)

2508.09624

Genre: Research Report > New Finding (0.48)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

ParallelSearch: Train your LLMs to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning

Zhao, Shu, Yu, Tan, Xu, Anbang, Singh, Japinder, Shukla, Aaditya, Akkiraju, Rama

Reasoning-augmented search agents such as Search-R1, trained via reinforcement learning with verifiable rewards (RLVR), demonstrate remarkable capabilities in multi-step information retrieval from external knowledge sources. These agents address the limitations of their parametric memory by dynamically gathering relevant facts to address complex reasoning tasks. However, existing approaches suffer from a fundamental architectural limitation: they process search queries strictly sequentially, even when handling inherently parallelizable and logically independent comparisons. This sequential bottleneck significantly constrains computational efficiency, particularly for queries that require multiple entity comparisons. To address this critical limitation, we propose ParallelSearch, a novel reinforcement learning framework that empowers large language models (LLMs) to recognize parallelizable query structures and execute multiple search operations concurrently. Our approach introduces dedicated reward functions that incentivize the identification of independent query components while preserving answer accuracy through jointly considering correctness, query decomposition quality, and parallel execution benefits. Comprehensive experiments demonstrate that ParallelSearch outperforms state-of-the-art baselines by an average performance gain of 2.9% across seven question-answering benchmarks. Notably, on parallelizable questions, our method achieves a 12.7% performance improvement while requiring only 69.6% of the LLM calls compared to sequential approaches.

large language model, machine learning, parallelsearch, (16 more...)

2508.09303

Country:

North America > United States (1.00)
Asia > Middle East > Republic of Türkiye (0.28)

Genre:

Research Report (1.00)
Personal > Obituary (0.46)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Media > Music (0.68)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Wilhelm, Connor, Ventura, Dan

Distilling Reinforcement Learning into Single-Batch Datasets

Dataset distillation compresses a large dataset into a small synthetic dataset such that learning on the synthetic dataset approximates learning on the original. Training on the distilled dataset can be performed in as little as one step of gradient descent. We demonstrate that distillation is generalizable to different tasks by distilling reinforcement learning environments into one-batch supervised learning datasets. This demonstrates not only distillation's ability to compress a reinforcement learning task but also its ability to transform one learning modality (reinforcement learning) into another (supervised learning). We present a novel extension of proximal policy optimization for meta-learning and use it in distillation of a multi-dimensional extension of the classic cart-pole problem, all Mu-JoCo environments, and several Atari games. We demonstrate distillation's ability to compress complex RL environments into one-step supervised learning, explore RL distillation's generalizability across learner architectures, and demonstrate distilling an environment into the smallest-possible synthetic dataset.

distillation, machine learning, reinforcement learning, (18 more...)

2508.09283

Genre: Research Report > New Finding (0.68)

Industry:

Leisure & Entertainment > Games > Computer Games (0.69)
Education (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Value Function Initialization for Knowledge Transfer and Jump-start in Deep Reinforcement Learning

Mehimeh, Soumia

Value function initialization (VFI) is an effective way to achieve a jumpstart in reinforcement learning (RL) by leveraging value estimates from prior tasks. While this approach is well established in tabular settings, extending it to deep reinforcement learning (DRL) poses challenges due to the continuous nature of the state-action space, the noisy approximations of neural networks, and the impracticality of storing all past models for reuse. In this work, we address these challenges and introduce DQInit, a method that adapts value function initialization to DRL. DQInit reuses compact tabular Q-values extracted from previously solved tasks as a transferable knowledge base. It employs a knownness-based mechanism to softly integrate these transferred values into underexplored regions and gradually shift toward the agent's learned estimates, avoiding the limitations of fixed time decay. Our approach offers a novel perspective on knowledge transfer in DRL by relying solely on value estimates rather than policies or demonstrations, effectively combining the strengths of jumpstart RL and policy distillation while mitigating their drawbacks. Experiments across multiple continuous control tasks demonstrate that DQInit consistently improves early learning efficiency, stability, and overall performance compared to standard initialization and existing transfer techniques.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2508.09277

Genre: Research Report > New Finding (0.68)

Industry: Education > Educational Setting (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models

Wang, Wenkai, Guo, Hongcan, Lv, Zheqi, Zhang, Shengyu

Self-evaluation, a model's ability to assess the correctness of its own output, is crucial for Large Multimodal Models (LMMs) to achieve self-improvement in multi-turn conversations, yet largely absent in foundation models. Recent work has employed reinforcement learning (RL) to enhance self-evaluation; however, its fixed reward mechanism suffers from reward hacking when optimizing multiple training objectives, leading to model collapse. In this paper we propose AdaPO, an online reinforcement learning framework capable of adaptively adjusting training objective in real time according to the current training state for each task. Specifically, to mitigate reward hacking , AdaPO introduces an Adaptive Reward Model (ARM) and a Reward Aware Dynamic KL Regularization mechanism. ARM assesses the task's training state from the distribution of model generated multi-turn trajectories' performance. Reward Aware Dynamic KL replaces a fixed penalty with dynamic coefficients which is modulated by the reward gap between different multi-turn situations. Notably, our method automatically and smoothly adjusts its learning focus based on sub-tasks' training progress without manual intervention. Extensive experiments over 8 benchmarks and various models show that our method significantly enhances both direct reasoning and self-evaluation capability. We will release our code to contribute to the community.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

2508.09155

Country: Asia > China (0.28)

Genre:

Instructional Material (0.74)
Research Report (0.50)
Workflow (0.46)

Industry: Education > Curriculum > Subject-Specific Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning

Omura, Motoki, Ota, Kazuki, Osa, Takayuki, Mukuta, Yusuke, Harada, Tatsuya

For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality. The code for this study is available at https://github.com/motokiomura/annealed-q-learning.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2506.05968

Country: Europe (0.46)

Genre:

Research Report > New Finding (1.00)
Instructional Material > Online (0.62)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)