Reinforcement Learning
Variance Reduced Policy Evaluation with Smooth Function Approximation
Hoi-To Wai, Mingyi Hong, Zhuoran Yang, Zhaoran Wang, Kexin Tang
Policy evaluation with smooth and nonlinear function approximation has shown great potential for reinforcement learning. Compared to linear function approximation, it allows for using a richer class of approximation functions such as the neural networks. Traditional algorithms are based on two timescales stochastic approximation whose convergence rate is often slow.
The authors would like to thank all the three reviewers for their useful feedback and the area chair for handling this
To address the reviewers' comments, upon acceptance of this paper, we will (i) include numerical experiment Some common concerns are as follows. Details of this experiment will be found in final version. Reviewer 1: We thank the reviewer for providing constructive and supportive comments. They will be corrected in the final version. Details will be provided in the final version.
The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning -- Supplementary Material -- AT abular Experiments
Here, we discuss some additional settings for the tabular experiments. The reason for this is that Sarsa(0.95), in contrast to MB-VI and MB-SU, is a multi-step Therefore, there is stochasticity in the update target even in deterministic environments due to exploration of the behavior policy. All methods used optimistic initialization. The pseudocode of the tabular, on-policy method used in Section 5.1 is shown in Algorithm 1. These estimates are updated at the end of the episode, using the data gathered during the episode.