Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Tengyu Xu, Shaofeng Zou, Yingbin Liang

Neural Information Processing Systems 

Neural Information Processing Systems http://nips.cc/