AMoreDiscussion

Feb-7-2026, 17:44:13 GMT–Neural Information Processing Systems

Learning the value function ofπ requires off-policyevaluation of π (i.e., learningQπ orVπ), which is prone to distribution shift. This is suboptimal especially when there does not exist one completepath starting from the start location to the goal locationintheofflinedataset. How can we do beyond action-stitching? The execute-policyisactually aninverse dynamicsmodel, which has been widely used in various waysinsequential decision-making. Inimitationlearning,[49]and[33]train an inverse dynamics model to label the state-only demonstrations with inferred actions.

amorediscussion, discountfactor 0, mini-batchsize 256, (12 more...)

Neural Information Processing Systems

Feb-7-2026, 17:44:13 GMT

Conferences PDF

Add feedback

Duplicate Docs Excel Report

Title
AMore Discussion

Similar Docs Excel Report more

Title	Similarity	Source
None found