AAppendix: LearningGuidanceRewardswith Trajectory-spaceSmoothing A.1 Monte-CarloEstimateoftheGuidanceRewards

Neural Information Processing Systems 

LetZπ(s,a) be the random variable denoting the sum of discounted rewards along a trajectory starting with the state-action pair(s,a).

Similar Docs  Excel Report  more

TitleSimilaritySource
None found