TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint
Lin, Haotian, Wang, Pengcheng, Schneider, Jeff, Shi, Guanya
–arXiv.org Artificial Intelligence
Through theoretical analysis in TD-MPC implementation leads to persistent value and experiments, we argue that this issue is deeply rooted overestimation. It is also empirically observed that the performance in the structural policy mismatch between the data generation of TD-MPC2 is far from satisfactory at some policy that is always bootstrapped by the planner and high-dimensional locomotion tasks [33]. This phenomenon the learned policy prior. To mitigate such a mismatch in is closely connected to, yet distinct from, the well-known a minimalist way, we propose a policy regularization term overestimation bias arising from function approximation reducing out-of-distribution (OOD) queries, thereby improving errors and error accumulation in temporal difference learning value learning. Our method involves minimum changes [39, 37, 7]. More precisely, we identify the underlying on top of existing frameworks and requires no additional issue as policy mismatch. The behavior policy generated by computation. Extensive experiments demonstrate that the the MPC planner governs data collection, creating a buffered proposed approach improves performance over baselines data distribution that does not directly align with the learned such as TD-MPC2 by large margins, particularly in 61-DoF value or policy prior.
arXiv.org Artificial Intelligence
Feb-5-2025