Exploring TD error as a heuristic for $\sigma$ selection in Q($\sigma$, $\lambda$)
In the landscape of TD algorithms, the Q( σ,λ) algorithm is an algorithm with the ability to perform a multi-step backup in an online manner while also successfully unifying the concepts of sampling with using the expectation across all actions for a state. Selecting the value of σ can be based on characteristics of the current state rather than having a constant value or being time based. This project explores the viability of such a TD-error based scheme. Introduction While having different dimensions of generalizability in an algorithm can serve as a powerful tool, in most cases it comes with the associated burden of having to manually select values along these dimensions, commonly referred to as hyper-parameter selection. In case of learning algorithms, an ideal algorithm would be completely general, even to the point that they do not need a fixed set of hyper-parameters for which they perform optimally for a given problem. In the context of Q( σ,λ), the introduction of the σ parameter gives us flexibility in terms of adjusting the proportion of sampling and expectation we want in our updates. But at the same time, while σ does serve as a hyper-parameter, atypically a constant value of σ was found to not have the best performance by De Asis, Hernandez-Garcia, Holland and Sutton (2018). They used a Dynamic Decay σ scheme for n-step Q( σ) where they reduced the value of σ after every episode by a factor of 0.95.
Dec-21-2019
- Country:
- North America > Canada > Alberta (0.14)
- Genre:
- Research Report
- New Finding (0.68)
- Experimental Study (0.46)
- Research Report
- Technology: