AMoreDiscussion
–Neural Information Processing Systems
Learning the value function ofπ requires off-policyevaluation of π (i.e., learningQπ orVπ), which is prone to distribution shift. This is suboptimal especially when there does not exist one completepath starting from the start location to the goal locationintheofflinedataset. How can we do beyond action-stitching? The execute-policyisactually aninverse dynamicsmodel, which has been widely used in various waysinsequential decision-making. Inimitationlearning,[49]and[33]train an inverse dynamics model to label the state-only demonstrations with inferred actions.
Neural Information Processing Systems
Feb-7-2026, 17:44:13 GMT