Not enough data to create a plot.
Try a different view from the menu above.
Duan, Xiaoming
Analysis of On-policy Policy Gradient Methods under the Distribution Mismatch
Wang, Weizhen, He, Jianping, Duan, Xiaoming
Policy gradient methods are one of the most successful methods for solving challenging reinforcement learning problems. However, despite their empirical successes, many SOTA policy gradient algorithms for discounted problems deviate from the theoretical policy gradient theorem due to the existence of a distribution mismatch. In this work, we analyze the impact of this mismatch on the policy gradient methods. Specifically, we first show that in the case of tabular parameterizations, the methods under the mismatch remain globally optimal. Then, we extend this analysis to more general parameterizations by leveraging the theory of biased stochastic gradient descent. Our findings offer new insights into the robustness of policy gradient methods as well as the gap between theoretical foundations and practical implementations.
Inverse Reinforcement Learning with Unknown Reward Model based on Structural Risk Minimization
Qu, Chendi, He, Jianping, Duan, Xiaoming, Chen, Jiming
Inverse reinforcement learning (IRL) usually assumes the model of the reward function is pre-specified and estimates the parameter only. However, how to determine a proper reward model is nontrivial. A simplistic model is less likely to contain the real reward function, while a model with high complexity leads to substantial computation cost and risks overfitting. This paper addresses this trade-off in IRL model selection by introducing the structural risk minimization (SRM) method from statistical learning. SRM selects an optimal reward function class from a hypothesis set minimizing both estimation error and model complexity. To formulate an SRM scheme for IRL, we estimate policy gradient by demonstration serving as empirical risk and establish the upper bound of Rademacher complexity of hypothesis classes as model penalty. The learning guarantee is further presented. In particular, we provide explicit SRM for the common linear weighted sum setting in IRL. Simulations demonstrate the performance and efficiency of our scheme.
Multiplayer Homicidal Chauffeur Reach-Avoid Games: A Pursuit Enclosure Function Approach
Yan, Rui, Duan, Xiaoming, Zou, Rui, He, Xin, Shi, Zongying, Bullo, Francesco
This paper presents a multiplayer Homicidal Chauffeur reach-avoid differential game, which involves Dubins-car pursuers and simple-motion evaders. The goal of the pursuers is to cooperatively protect a planar convex region from the evaders, who strive to reach the region. We propose a cooperative strategy for the pursuers based on subgames for multiple pursuers against one evader and optimal task allocation. We introduce pursuit enclosure functions (PEFs) and propose a new enclosure region pursuit (ERP) winning approach that supports forward analysis for the strategy synthesis in the subgames. We show that if a pursuit coalition is able to defend the region against an evader under the ERP winning, then no more than two pursuers in the coalition are necessarily needed. We also propose a steer-to-ERP approach to certify the ERP winning and synthesize the ERP winning strategy. To implement the strategy, we introduce a positional PEF and provide the necessary parameters, states, and strategies that ensure the ERP winning for both one pursuer and two pursuers against one evader. Additionally, we formulate a binary integer program using the subgame outcomes to maximize the captured evaders in the ERP winning for the pursuit task allocation. Finally, we propose a multiplayer receding-horizon strategy where the ERP winnings are checked in each horizon, the task is allocated, and the strategies of the pursuers are determined. Numerical examples are provided to illustrate the results.
Affordance-Driven Next-Best-View Planning for Robotic Grasping
Zhang, Xuechao, Wang, Dong, Han, Sun, Li, Weichuang, Zhao, Bin, Wang, Zhigang, Duan, Xiaoming, Fang, Chongrong, Li, Xuelong, He, Jianping
Grasping occluded objects in cluttered environments is an essential component in complex robotic manipulation tasks. In this paper, we introduce an AffordanCE-driven Next-Best-View planning policy (ACE-NBV) that tries to find a feasible grasp for target object via continuously observing scenes from new viewpoints. This policy is motivated by the observation that the grasp affordances of an occluded object can be better-measured under the view when the view-direction are the same as the grasp view. Specifically, our method leverages the paradigm of novel view imagery to predict the grasps affordances under previously unobserved view, and select next observation view based on the highest imagined grasp quality of the target object. The experimental results in simulation and on a real robot demonstrate the effectiveness of the proposed affordance-driven next-best-view planning policy. Project page: https://sszxc.net/ace-nbv/.
HiCRISP: A Hierarchical Closed-Loop Robotic Intelligent Self-Correction Planner
Ming, Chenlin, Lin, Jiacheng, Fong, Pangkit, Wang, Han, Duan, Xiaoming, He, Jianping
Abstract-- The integration of Large Language Models (LLMs) into robotics has revolutionized human-robot interactions and autonomous task planning. However, these systems are often unable to self-correct during the task execution, which hinders their adaptability in dynamic real-world environments. To address this issue, we present a Hierarchical Closed-loop Robotic Intelligent Self-correction Planner (HiCRISP), an innovative framework that enables robots to correct errors within individual steps during the task execution. HiCRISP actively monitors and adapts the task execution process, addressing both high-level planning and low-level action errors. This enhancement has the potential to propel smart [4], and logical reasoning [5], [6].
Learning-Based Motion Planning with Mixture Density Networks
Wang, Yinghan, Duan, Xiaoming, He, Jianping
The trade-off between computation time and path optimality is a key consideration in motion planning algorithms. While classical sampling based algorithms fall short of computational efficiency in high dimensional planning, learning based methods have shown great potential in achieving time efficient and optimal motion planning. The SOTA learning based motion planning algorithms utilize paths generated by sampling based methods as expert supervision data and train networks via regression techniques. However, these methods often overlook the important multimodal property of the optimal paths in the training set, making them incapable of finding good paths in some scenarios. In this paper, we propose a Multimodal Neuron Planner (MNP) based on the mixture density networks that explicitly takes into account the multimodality of the training data and simultaneously achieves time efficiency and path optimality. For environments represented by a point cloud, MNP first efficiently compresses the point cloud into a latent vector by encoding networks that are suitable for processing point clouds. We then design multimodal planning networks which enables MNP to learn and predict multiple optimal solutions. Simulation results show that our method outperforms SOTA learning based method MPNet and advanced sampling based methods IRRT* and BIT*.
Control Input Inference of Mobile Agents under Unknown Objective
Qu, Chendi, He, Jianping, Duan, Xiaoming, Wu, Shukun
Trajectory and control secrecy is an important issue in robotics security. This paper proposes a novel algorithm for the control input inference of a mobile agent without knowing its control objective. Specifically, the algorithm first estimates the target state by applying external perturbations. Then we identify the objective function based on the inverse optimal control, providing the well-posedness proof and the identifiability analysis. Next, we obtain the optimal estimate of the control horizon using binary search. Finally, the agent's control optimization problem is reconstructed and solved to predict its input. Simulation illustrates the efficiency and the performance of the algorithm.
Reinforcement Learning with Temporal-Logic-Based Causal Diagrams
Paliwal, Yash, Roy, Rajarshi, Gaglione, Jean-Raphaรซl, Baharisangari, Nasim, Neider, Daniel, Duan, Xiaoming, Topcu, Ufuk, Xu, Zhe
We study a class of reinforcement learning (RL) tasks where the objective of the agent is to accomplish temporally extended goals. In this setting, a common approach is to represent the tasks as deterministic finite automata (DFA) and integrate them into the state-space for RL algorithms. However, while these machines model the reward function, they often overlook the causal knowledge about the environment. To address this limitation, we propose the Temporal-Logic-based Causal Diagram (TL-CD) in RL, which captures the temporal causal relationships between different properties of the environment. We exploit the TL-CD to devise an RL algorithm in which an agent requires significantly less exploration of the environment. To this end, based on a TL-CD and a task DFA, we identify configurations where the agent can determine the expected rewards early during an exploration. Through a series of case studies, we demonstrate the benefits of using TL-CDs, particularly the faster convergence of the algorithm to an optimal policy due to reduced exploration of the environment.
Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning
Pan, Haoxuan, Ye, Deheng, Duan, Xiaoming, Fu, Qiang, Yang, Wei, He, Jianping, Sun, Mingfei
We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL) perspective. The objective is formulated theoretically as the expected returns discounted over the time horizon. One of the major policy gradient biases is the state distribution shift: the state distribution used to estimate the gradients differs from the theoretical formulation in that it does not take into account the discount factor. Existing discussion of the influence of this bias was limited to the tabular and softmax cases in the literature. Therefore, in this paper, we extend it to the DRL setting where the policy is parameterized and demonstrate how this bias can lead to suboptimal policies theoretically. We then discuss why the empirically inaccurate implementations with shifted state distribution can still be effective. We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the following three ways: 1) a small learning rate; 2) an adaptive-learning-rate-based optimizer; and 3) KL regularization. Specifically, we show that a smaller learning rate, or, an adaptive learning rate, such as that used by Adam and RSMProp optimizers, makes the policy optimization robust to the bias. We further draw connections between optimizers and the optimization regularization to show that both the KL and the reverse KL regularization can significantly rectify this bias. Moreover, we provide extensive experiments on continuous control tasks to support our analysis. Our paper sheds light on how successful PG algorithms optimize policies in the DRL setting, and contributes insights into the practical issues in DRL.
Robust Pandemic Control Synthesis with Formal Specifications: A Case Study on COVID-19 Pandemic
Xu, Zhe, Duan, Xiaoming
Pandemics can bring a range of devastating consequences to public health and the world economy. Identifying the most effective control strategies has been the imperative task all around the world. Various public health control strategies have been proposed and tested against pandemic diseases (e.g., COVID-19). We study two specific pandemic control models: the susceptible, exposed, infectious, recovered (SEIR) model with vaccination control; and the SEIR model with shield immunity control. We express the pandemic control requirement in metric temporal logic (MTL) formulas. We then develop an iterative approach for synthesizing the optimal control strategies with MTL specifications. We provide simulation results in two different scenarios for robust control of the COVID-19 pandemic: one for vaccination control, and another for shield immunity control, with the model parameters estimated from data in Lombardy, Italy. The results show that the proposed synthesis approach can generate control inputs such that the time-varying numbers of individuals in each category (e.g., infectious, immune) satisfy the MTL specifications with robustness against initial state and parameter uncertainties.