Regularized Off-Policy TD-Learning
–Neural Information Processing Systems
The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.
Neural Information Processing Systems
Mar-14-2024, 05:01:48 GMT
- Country:
- North America > United States
- Wisconsin > Dane County
- Madison (0.04)
- Massachusetts > Hampshire County
- Amherst (0.04)
- Wisconsin > Dane County
- North America > United States
- Genre:
- Research Report (0.46)
- Technology: