AITopics

Abstract--Dynamic modeling and control are critical for unleashing soft robots' potential, yet remain challenging due to their complex constitutive behaviors and real-world operating conditions. Bio-inspired musculoskeletal robots, which integrate rigid skeletons with soft actuators, combine high load-bearing capacity with inherent flexibility. Although actuation dynamics have been studied through experimental methods and surrogate models, accurate and effective modeling and simulation remain a significant challenge, especially for large-scale hybrid rigid-soft robots with continuously distributed mass, kinematic loops, and diverse motion modes. T o address these challenges, we propose EquiMus, an energy-equivalent dynamic modeling framework and MuJoCo-based simulation for musculoskeletal rigid-soft hybrid robots with linear elastic actuators. The equivalence and effectiveness of the proposed approach are validated and examined through both simulations and real-world experiments on a bionic robotic leg. EquiMus further demonstrates its utility for downstream tasks, including controller design and learning-based control strategies.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

doi: 10.1109/LRA.2025.3621980

2511.07887

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)

Algorithm-Relative Trajectory Valuation in Policy Gradient Control

Li, Shihao, Li, Jiachen, Xu, Jiamin, Martin, Christopher, Li, Wei, Chen, Dongmei

We study how trajectory value depends on the learning algorithm in policy-gradient control. Using Trajectory Shapley in an uncertain LQR, we find a robust negative correlation between a trajectory's information content--Persistence of Excitation (PE)--and its marginal value under vanilla REINFORCE (e.g., r 0.38). We prove a variance-mediated mechanism: (i) for fixed energy, higher PE yields lower gradient variance; (ii) near saddle regions, higher variance increases the probability of escaping poor basins and thus raises marginal contribution. When the update is stabilized (state whitening or Fisher preconditioning), this variance channel is neutralized and information content dominates, flipping the correlation positive (e.g., r +0.29). Hence, trajectory value is algorithm-relative: it emerges from the interaction between data statistics and update dynamics. Experiments on LQR validate the two-step mechanism and the flip, and show that decision-aligned scores (Leave-One-Out) complement Shapley for pruning near the full set, while Shapley remains effective for identifying high-impact (and toxic) subsets.

machine learning, reinforcement learning, variance, (19 more...)

2511.07878

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

Sun, Xiaolin, Liu, Feidi, Ding, Zhengming, Zheng, ZiZhan

Reinforcement learning (RL) systems, while achieving remarkable success across various domains, are vulnerable to adversarial attacks. This is especially a concern in vision-based environments where minor manipulations of high-dimensional image inputs can easily mislead the agent's behavior. To this end, various defenses have been proposed recently, with state-of-the-art approaches achieving robust performance even under large state perturbations. However, after closer investigation, we found that the effectiveness of the current defenses is due to a fundamental weakness of the existing $l_p$ norm-constrained attacks, which can barely alter the semantics of image input even under a relatively large perturbation budget. In this work, we propose SHIFT, a novel policy-agnostic diffusion-based state perturbation attack to go beyond this limitation. Our attack is able to generate perturbed states that are semantically different from the true states while remaining realistic and history-aligned to avoid detection. Evaluations show that our attack effectively breaks existing defenses, including the most sophisticated ones, significantly outperforming existing attacks while being more perceptually stealthy. The results highlight the vulnerability of RL agents to semantics-aware adversarial perturbations, indicating the importance of developing more robust policies.

diffusion model, machine learning, reinforcement learning, (19 more...)

2511.07701

Country: Asia (0.27)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Jin, Yue, Montana, Giovanni

Partial Action Replacement: Tackling Distribution Shift in Offline MARL

Offline multi-agent reinforcement learning (MARL) is severely hampered by the challenge of evaluating out-of-distribution (OOD) joint actions. Our core finding is that when the behavior policy is factorized--a common scenario where agents act fully or partially independently during data collection--a strategy of partial action replacement (P AR) can significantly mitigate this challenge. P AR updates a single or part of agents' actions while the others remain fixed to the behavioral data, reducing distribution shift compared to full joint-action updates. Based on this insight, we develop Soft-Partial Conservative Q-Learning (SPaCQL), using P AR to mitigate OOD issue and dynamically weighting different P AR strategies based on the uncertainty of value estimation. We provide a rigorous theoretical foundation for this approach, proving that under factorized behavior policies, the induced distribution shift scales linearly with the number of deviating agents rather than exponentially with the joint-action space. This yields a provably tighter value error bound for this important class of offline MARL problems. Our theoretical results also indicate that SPaCQL adap-tively addresses distribution shift using uncertainty-informed weights. Our empirical results demonstrate SPaCQL enables more effective policy learning, and manifest its remarkable superiority over baseline algorithms when the offline dataset exhibits the independence structure.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2511.07629

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Towards Adaptive Humanoid Control via Multi-Behavior Distillation and Reinforced Fine-Tuning

Zhao, Yingnan, Wang, Xinmiao, Wang, Dewei, Liu, Xinzhe, Lu, Dan, Han, Qilong, Liu, Peng, Bai, Chenjia

Humanoid robots are promising to learn a diverse set of human-like locomotion behaviors, including standing up, walking, running, and jumping. However, existing methods predominantly require training independent policies for each skill, yielding behavior-specific controllers that exhibit limited generalization and brittle performance when deployed on irregular terrains and in diverse situations. To address this challenge, we propose Adaptive Humanoid Control (AHC) that adopts a two-stage framework to learn an adaptive humanoid locomotion controller across different skills and terrains. Specifically, we first train several primary locomotion policies and perform a multi-behavior distillation process to obtain a basic multi-behavior controller, facilitating adaptive behavior switching based on the environment. Then, we perform reinforced fine-tuning by collecting online feedback in performing adaptive behaviors on more diverse terrains, enhancing terrain adaptability for the controller. We conduct experiments in both simulation and real-world experiments in Unitree G1 robots. The results show that our method exhibits strong adaptability across various situations and terrains.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2511.06371

Country: Asia > China (0.46)

Genre: Research Report > New Finding (0.66)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)

Policy Transfer for Continuous-Time Reinforcement Learning: A (Rough) Differential Equation Approach

Guo, Xin, Lyu, Zijiu

This paper studies policy transfer, one of the well-known transfer learning techniques adopted in large language models, for two classes of continuous-time reinforcement learning problems. In the first class of continuous-time linear-quadratic systems with Shannon's entropy regularization (a.k.a. LQRs), we fully exploit the Gaussian structure of their optimal policy and the stability of their associated Riccati equations. In the second class where the system has possibly non-linear and bounded dynamics, the key technical component is the stability of diffusion SDEs which is established by invoking the rough path theory. Our work provides the first theoretical proof of policy transfer for continuous-time RL: an optimal policy learned for one RL problem can be used to initialize the search for a near-optimal policy in a closely related RL problem, while maintaining the convergence rate of the original algorithm. To illustrate the benefit of policy transfer for RL, we propose a novel policy learning algorithm for continuous-time LQRs, which achieves global linear convergence and local super-linear convergence. As a byproduct of our analysis, we derive the stability of a concrete class of continuous-time score-based diffusion models via their connection with LQRs.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2510.15165

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Education (0.87)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)

Optimism as Risk-Seeking in Multi-Agent Reinforcement Learning

Zhang, Runyu, Li, Na, Ozdaglar, Asuman, Shamma, Jeff, Zardini, Gioele

Risk sensitivity has become a central theme in reinforcement learning (RL), where convex risk measures and robust formulations provide principled ways to model preferences beyond expected return. Recent extensions to multi-agent RL (MARL) have largely emphasized the risk-averse setting, prioritizing robustness to uncertainty. In cooperative MARL, however, such conservatism often leads to suboptimal equilibria, and a parallel line of work has shown that optimism can promote cooperation. Existing optimistic methods, though effective in practice, are typically heuristic and lack theoretical grounding. Building on the dual representation for convex risk measures, we propose a principled framework that interprets risk-seeking objectives as optimism. We introduce optimistic value functions, which formalize optimism as divergence-penalized risk-seeking evaluations. Building on this foundation, we derive a policy-gradient theorem for optimistic value functions, including explicit formulas for the entropic risk/KL-penalty setting, and develop decentralized optimistic actor-critic algorithms that implement these updates. Empirical results on cooperative benchmarks demonstrate that risk-seeking optimism consistently improves coordination over both risk-neutral baselines and heuristic optimistic methods. Our framework thus unifies risk-sensitive learning and optimism, offering a theoretically grounded and practically effective approach to cooperation in MARL.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2509.24047

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Ghosh, Debamita, Atia, George K., Wang, Yue

ORVIT: Near-Optimal Online Distributionally Robust Reinforcement Learning

We investigate reinforcement learning (RL) in the presence of distributional mismatch between training and deployment, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions, and thereby reliable guarantees on real-world performance are essential. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment-assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study a more practical and challenging setting: online distributionally robust RL, where the agent interacts only with a single unknown training environment while seeking policies that are robust with respect to an uncertainty set around this nominal model. We consider general $f$-divergence-based ambiguity sets, including $χ^2$ and KL divergence balls, and design a computationally efficient algorithm that achieves sublinear regret for the robust control objective under minimal assumptions, without requiring generative or offline data access. Moreover, we establish a corresponding minimax lower bound on the regret of any online algorithm, demonstrating the near-optimality of our method. Experiments across diverse environments with model misspecification show that our approach consistently improves worst-case performance and aligns with the theoretical guarantees.

data mining, machine learning, reinforcement learning, (16 more...)

2508.03768

Country: North America > United States (0.45)

Genre: Research Report (0.64)

Industry:

Education (1.00)
Leisure & Entertainment > Games (0.45)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Jain, Shalin Anand, Liu, Jiazhen, Kailas, Siva, Ravichandar, Harish

JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes

Multi-agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi-robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi-robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluation of their individual contributions. The Multi-Agent RL Benchmark and Learning Environment for the Robotarium (MARBLER) is an exciting recent step in providing a standardized robotics-relevant platform for MARL, by bridging the Robotarium testbed with existing MARL software infrastructure. However, MARBLER lacks support for parallelization and GPU/TPU execution, making the platform prohibitively slow compared to modern MARL environments and hindering adoption. We contribute JaxRobotarium, a Jax-powered end-to-end simulation, learning, deployment, and benchmarking platform for the Robotarium. JaxRobotarium enables rapid training and deployment of multi-robot RL (MRRL) policies with realistic robot dynamics and safety constraints, supporting parallelization and hardware acceleration. Our generalizable learning interface integrates easily with SOTA MARL libraries (e.g., JaxMARL). In addition, JaxRobotarium includes eight standardized coordination scenarios, including four novel scenarios that bring established MARL benchmark tasks (e.g., RWARE and Level-Based Foraging) to a robotics setting. We demonstrate that JaxRobotarium retains high simulation fidelity while achieving dramatic speedups over baseline (20x in training and 150x in simulation), and provides an open-access sim-to-real evaluation pipeline through the Robotarium testbed, accelerating and democratizing access to multi-robot learning research and evaluation. Our code is available at https://github.com/GT-STAR-Lab/JaxRobotarium.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2505.06771

Genre: Research Report (0.70)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.89)

Lee, Joongkyu, Yi, Seouh-won, Oh, Min-hwan

Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

arXiv.org Machine LearningNov-12-2025

We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{O}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter's norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $Ω\left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.

large language model, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2510.18713

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.84)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)