A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

Open in new window