O$^2$TD: (Near)-Optimal Off-Policy TD Learning

Liu, Bo, Lyu, Daoming, Dong, Wen, Biaz, Saad

Apr-19-2017–arXiv.org Machine Learning

Temporal difference learning and Residual Gradient methods are the most widely used temporal difference based learning algorithms; however, it has been shown that none of their objective functions is optimal w.r.t approximating the true value function V. Two novel algorithms are proposed to approximate the true value function V. This paper makes the following contributions: - A batch algorithm that can help find the approximate optimal off-policy prediction of the true value function V. - A linear computational cost (per step) near-optimal algorithm that can learn from a collection of off-policy samples.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

Apr-19-2017

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report (0.65)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found