Goto

Collaborating Authors

 withprobabilityone


AUnifiedSwitchingSystemPerspectiveand ConvergenceAnalysisofQ-LearningAlgorithms

Neural Information Processing Systems

However, its application to Q-learning has been limited due to the presence of the max-operator, which makes the associated ODE model a complex nonlinear system. In contrast, the associated ODE of TD learning for policy evaluation is a linear system, whose asymptotic stability is much easier to analyze in general.