Reinforcement Learning
AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting
Li, Renda, Huang, Hailang, Wei, Fei, Xiong, Feng, Wang, Yong, Chu, Xiangxiang
Reinforcement learning (RL) has demonstrated considerable potential for enhancing reasoning in large language models (LLMs). However, existing methods suffer from Gradient Starvation and Policy Degradation when training directly on samples with mixed difficulty. To mitigate this, prior approaches leverage Chain-of-Thought (CoT) data, but the construction of high-quality CoT annotations remains labor-intensive. Alternatively, curriculum learning strategies have been explored but frequently encounter challenges, such as difficulty mismatch, reliance on manual curriculum design, and catastrophic forgetting. To address these issues, we propose AdaCuRL, a Adaptive Curriculum Reinforcement Learning framework that integrates coarse-to-fine difficulty estimation with adaptive curriculum scheduling. This approach dynamically aligns data difficulty with model capability and incorporates a data revisitation mechanism to mitigate catastrophic forgetting. Furthermore, AdaCuRL employs adaptive reference and sparse KL strategies to prevent Policy Degradation. Extensive experiments across diverse reasoning benchmarks demonstrate that AdaCuRL consistently achieves significant performance improvements on both LLMs and MLLMs.
Latent Planning via Embedding Arithmetic: A Contrastive Approach to Strategic Reasoning
Hamara, Andrew, Hamerly, Greg, Rivas, Pablo, Freeman, Andrew C.
Planning in high-dimensional decision spaces is increasingly being studied through the lens of learned representations. Rather than training policies or value heads, we investigate whether planning can be carried out directly in an evaluation-aligned embedding space. We introduce SOLIS, which learns such a space using supervised contrastive learning. In this representation, outcome similarity is captured by proximity, and a single global advantage vector orients the space from losing to winning regions. Candidate actions are then ranked according to their alignment with this direction, reducing planning to vector operations in latent space. We demonstrate this approach in chess, where SOLIS uses only a shallow search guided by the learned embedding to reach competitive strength under constrained conditions. More broadly, our results suggest that evaluation-aligned latent planning offers a lightweight alternative to traditional dynamics models or policy learning.
CoRL-MPPI: Enhancing MPPI With Learnable Behaviours For Efficient And Provably-Safe Multi-Robot Collision Avoidance
Dergachev, Stepan, Pshenitsyn, Artem, Panov, Aleksandr, Skrynnik, Alexey, Yakovlev, Konstantin
Decentralized collision avoidance remains a core challenge for scalable multi-robot systems. One of the promising approaches to tackle this problem is Model Predictive Path Integral (MPPI) -- a framework that is naturally suited to handle any robot motion model and provides strong theoretical guarantees. Still, in practice MPPI-based controller may provide suboptimal trajectories as its performance relies heavily on uninformed random sampling. In this work, we introduce CoRL-MPPI, a novel fusion of Cooperative Reinforcement Learning and MPPI to address this limitation. We train an action policy (approximated as deep neural network) in simulation that learns local cooperative collision avoidance behaviors. This learned policy is then embedded into the MPPI framework to guide its sampling distribution, biasing it towards more intelligent and cooperative actions. Notably, CoRL-MPPI preserves all the theoretical guarantees of regular MPPI. We evaluate our approach in dense, dynamic simulation environments against state-of-the-art baselines, including ORCA, BVC, and a multi-agent MPPI implementation. Our results demonstrate that CoRL-MPPI significantly improves navigation efficiency (measured by success rate and makespan) and safety, enabling agile and robust multi-robot navigation.
MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment
Amiri, Mohsen, Avrachenkov, Konstantin, Mimouni, Ibtihal El, Magnรบsson, Sindri
Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated simulator-embedded (digital twin) recommender system, where QWI consistently adapts to a shifting latent state and converges to an optimal policy, empirically corroborating our theoretical findings.
Learning Efficient Communication Protocols for Multi-Agent Reinforcement Learning
Zhang, Xinren, Yu, Jiadong, Zhong, Zixin
Multi-Agent Systems (MAS) have emerged as a powerful paradigm for modeling complex interactions among autonomous entities in distributed environments. In Multi-Agent Reinforcement Learning (MARL), communication enables coordination but can lead to inefficient information exchange, since agents may generate redundant or non-essential messages. While prior work has focused on boosting task performance with information exchange, the existing research lacks a thorough investigation of both the appropriate definition and the optimization of communication protocols (communication topology and message). To fill this gap, we introduce a generalized framework for learning multi-round communication protocols that are both effective and efficient. Within this framework, we propose three novel Communication Efficiency Metrics (CEMs) to guide and evaluate the learning process: the Information Entropy Efficiency Index (IEI) and Specialization Efficiency Index (SEI) for efficiency-augmented optimization, and the Topology Efficiency Index (TEI) for explicit evaluation. We integrate IEI and SEI as the adjusted loss functions to promote informative messaging and role specialization, while using TEI to quantify the trade-off between communication volume and task performance. Through comprehensive experiments, we demonstrate that our learned communication protocol can significantly enhance communication efficiency and achieves better cooperation performance with improved success rates.
Diffusion Policies with Value-Conditional Optimization for Offline Reinforcement Learning
Ma, Yunchang, Liu, Tenglong, Lan, Yixing, Yin, Xin, Zhang, Changxin, Zhang, Xinglong, Xu, Xin
In offline reinforcement learning, value overestimation caused by out-of-distribution (OOD) actions significantly limits policy performance. Recently, diffusion models have been leveraged for their strong distribution-matching capabilities, enforcing conservatism through behavior policy constraints. However, existing methods often apply indiscriminate regularization to redundant actions in low-quality datasets, resulting in excessive conservatism and an imbalance between the expressiveness and efficiency of diffusion modeling. To address these issues, we propose DIffusion policies with Value-conditional Optimization (DIVO), a novel approach that leverages diffusion models to generate high-quality, broadly covered in-distribution state-action samples while facilitating efficient policy improvement. Specifically, DIVO introduces a binary-weighted mechanism that utilizes the advantage values of actions in the offline dataset to guide diffusion model training. This enables a more precise alignment with the dataset's distribution while selectively expanding the boundaries of high-advantage actions. During policy improvement, DIVO dynamically filters high-return-potential actions from the diffusion model, effectively guiding the learned policy toward better performance. This approach achieves a critical balance between conservatism and explorability in offline RL. We evaluate DIVO on the D4RL benchmark and compare it against state-of-the-art baselines. Empirical results demonstrate that DIVO achieves superior performance, delivering significant improvements in average returns across locomotion tasks and outperforming existing methods in the challenging AntMaze domain, where sparse rewards pose a major difficulty.
A Shared Control Framework for Mobile Robots with Planning-Level Intention Prediction
Zhang, Jinyu, Han, Lijun, Jian, Feng, Zhang, Lingxi, Wang, Hesheng
Abstract--In mobile robot shared control, effectively understanding human motion intention is critical for seamless human-robot collaboration. This paper presents a novel shared control framework featuring planning-level intention prediction. A path replanning algorithm is designed to adjust the robot's desired trajectory according to inferred human intentions. T o represent future motion intentions, we introduce the concept of an intention domain, which serves as a constraint for path replanning. The intention-domain prediction and path replanning problems are jointly formulated as a Markov Decision Process and solved through deep reinforcement learning. In addition, a V oronoi-based human trajectory generation algorithm is developed, allowing the model to be trained entirely in simulation without human participation or demonstration data. Extensive simulations and real-world user studies demonstrate that the proposed method significantly reduces operator workload and enhances safety, without compromising task efficiency compared with existing assistive teleoperation approaches. OBILE robots have advanced significantly in locomotion, perception, and navigation. However, they still struggle to handle demanding real-world tasks such as search and rescue. Their limitations in perception and cognitive awareness prevent them from adapting to complex and unpredictable environments. A promising direction to overcome these challenges is the integration of a human operator into the system, which is often referred to as a shared control framework. As a result, system performance can be substantially improved. In many tasks, mobile robots are expected to reach a target location or follow a predefined path.
DRL-Based Beam Positioning for LEO Satellite Constellations with Weighted Least Squares
Chou, Po-Heng, Wang, Chiapin, Chen, Kuan-Hao, Hsiao, Wei-Chen
Abstract--In this paper, we propose a reinforcement learning based beam weighting framework that couples a policy networ k with an augmented weighted least squares (WLS) estimator fo r accurate and low-complexity positioning in multi-beam LEO constellations. Unlike conventional geometry or CSI-depe ndent approaches, the policy learns directly from uplink pilot re sponses and geometry features, enabling robust localization witho ut explicit CSI estimation. Across representative scenar ios, the proposed method reduces the mean positioning error by 99.3% compared with the geometry-based baseline, achievin g 0.395 m RMSE with near real-time inference. The integration of terrestrial, aerial, and satellite segm ents into a unified ground-air-space architecture has emerged as a key enabler for future sixth-generation (6G) networks, promising seamless connectivity, low latency, and global coverage [1]. Among these, low Earth orbit (LEO) satellite constellations are particularly attractive due to their wi de coverage, rapid revisit capability, and suitability for de lay-sensitive services.
TIGER-MARL: Enhancing Multi-Agent Reinforcement Learning with Temporal Information through Graph-based Embeddings and Representations
Gupta, Nikunj, Twardecka, Ludwika, Hare, James Zachary, Milzman, Jesse, Kannan, Rajgopal, Prasanna, Viktor
In this paper, we propose capturing and utilizing \textit{Temporal Information through Graph-based Embeddings and Representations} or \textbf{TIGER} to enhance multi-agent reinforcement learning (MARL). We explicitly model how inter-agent coordination structures evolve over time. While most MARL approaches rely on static or per-step relational graphs, they overlook the temporal evolution of interactions that naturally arise as agents adapt, move, or reorganize cooperation strategies. Capturing such evolving dependencies is key to achieving robust and adaptive coordination. To this end, TIGER constructs dynamic temporal graphs of MARL agents, connecting their current and historical interactions. It then employs a temporal attention-based encoder to aggregate information across these structural and temporal neighborhoods, yielding time-aware agent embeddings that guide cooperative policy learning. Through extensive experiments on two coordination-intensive benchmarks, we show that TIGER consistently outperforms diverse value-decomposition and graph-based MARL baselines in task performance and sample efficiency. Furthermore, we conduct comprehensive ablation studies to isolate the impact of key design parameters in TIGER, revealing how structural and temporal factors can jointly shape effective policy learning in MARL. All codes can be found here: https://github.com/Nikunj-Gupta/tiger-marl.
Interpretable by Design: Query-Specific Neural Modules for Explainable Reinforcement Learning
Reinforcement learning has traditionally focused on a singular objective: learning policies that select actions to maximize reward. We challenge this paradigm by asking: what if we explicitly architected RL systems as inference engines that can answer diverse queries about their environment? In deterministic settings, trained agents implicitly encode rich knowledge about reachability, distances, values, and dynamics - yet current architectures are not designed to expose this information efficiently. We introduce Query Conditioned Deterministic Inference Networks (QDIN), a unified architecture that treats different types of queries (policy, reachability, paths, comparisons) as first-class citizens, with specialized neural modules optimized for each inference pattern. Our key empirical finding reveals a fundamental decoupling: inference accuracy can reach near-perfect levels (99% reachability IoU) even when control performance remains suboptimal (31% return), suggesting that the representations needed for accurate world knowledge differ from those required for optimal control. Experiments demonstrate that query specialized architectures outperform both unified models and post-hoc extraction methods, while maintaining competitive control performance. This work establishes a research agenda for RL systems designed from inception as queryable knowledge bases, with implications for interpretability, verification, and human-AI collaboration.