AMoreDiscussion

Neural Information Processing Systems 

Learning the value function ofπ requires off-policyevaluation of π (i.e., learningQπ orVπ), which is prone to distribution shift. This is suboptimal especially when there does not exist one completepath starting from the start location to the goal locationintheofflinedataset. How can we do beyond action-stitching? The execute-policyisactually aninverse dynamicsmodel, which has been widely used in various waysinsequential decision-making. Inimitationlearning,[49]and[33]train an inverse dynamics model to label the state-only demonstrations with inferred actions.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found