Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Mar-27-2025, 02:47:36 GMT–Neural Information Processing Systems

Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Mar-27-2025, 02:47:36 GMT

Conferences PDF

Add feedback

Country:
- North America
  - Canada (0.46)
  - United States > Ohio (0.40)

Genre:
- Research Report > New Finding (0.47)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)