Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

Oct-10-2024, 14:59:48 GMT–Neural Information Processing Systems

Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of fixed length within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we consider as quality index the variance of an unbiased policy return estimator that uses trajectories of different lengths, i.e., truncated. We first derive a closed-form expression of this variance that clearly shows the sub-optimality of the fixed-length trajectory schedule.

adaptive approach, monte carlo policy evaluation, truncating trajectory, (8 more...)

Neural Information Processing Systems

Oct-10-2024, 14:59:48 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)