Goto

Collaborating Authors

 per-state uncertainty estimate


Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

Neural Information Processing Systems

We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates. We demonstrate in a variety of policy evaluation tasks that this simple adaptive algorithm performs competitively with the best approach in hindsight, suggesting that learned confidence intervals are a powerful technique for adapting policy evaluation to use TD or MC returns in a data-driven way.


Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

Neural Information Processing Systems

We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates.


Reviews: Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

Neural Information Processing Systems

The authors propose a novel method for adaptively using either the MC method for policy evaluation or the temporal difference method. The authors aim to solve the problem of balancing bias and variance in the reinforcement learning setting and to this end propose the Adaptive TD algorithm. The algorithm takes as input a set of sample episodes which it uses to bootstrap some confidence intervals for the value function of each state. It then compares the TD estimate for each of these states with these confidence intervals and keeps the TD estimate if it fits inside, otherwise, it picks the middle of the confidence interval as it assumes the TD estimate is essentially biased and inaccurate. The process repeats for a number of epochs (since the TD estimates change as the value function estimate for the future state is updated by the adaptive-TD rule). I think this paper shows promise: the method is, to my knowledge, original and from the numerical experiments seems to achieve the target the authors set for it - dominating TD and MC in the worst case.


Reviews: Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

Neural Information Processing Systems

The argumentation defending the proposed approach, and the numerical evaluation of its performance on realistic examples, are convincing. Despite the fact that the reviewers finally agree on the fact that NeurIPS might not be the best venue for this work, because of the quasi-absence of a theoretical part, I recommend to give it a chance it for the quality of the other dimensions of this work. If the paper is finally rejected, I recommend to the authors to follow the suggestions of the reviews, and to either re-submit to a more speciallized conference, or to consider a theoretical analysis (which can be expected to be rather involved).


Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

Neural Information Processing Systems

We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates.


Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

Neural Information Processing Systems

We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates.