Goto

Collaborating Authors

 Giegrich, Michael


$K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic Control

arXiv.org Machine Learning

In reinforcement learning (RL), off-policy evaluation (OPE) deals with the problem of estimating the value of a target policy with observations generated from a different behavior policy. OPE methods are typically applied to sequential decision making problems where observational data is available but experimentation with the environment directly is not possible or costly. More broadly, OPE methods are a widely researched subject in RL (see, e.g., [60, 23, 58] for recent overviews), however, relatively little attention has been paid to stochastic environments where the stochasticity depends on the chosen actions and state and action spaces are continuous. For example, common benchmark problems are either deterministic or have finite state and/or action spaces (see, e.g., [60, 23]). Notwithstanding this, stochastic control problems are precisely concerned with the setting where a decision process affects random transitions. Stochastic control is a field closely related to reinforcement learning and its methods have been applied to a wide range of high-stakes decision-making problems in diverse fields such as operations research [24, 41], economics [31, 29], electrical engineering [44, 17], autonomous driving [62] and finance [15, 55]. In the stochastic control literature, optimal policies are often represented as deterministic feedback policies (i.e., as deterministic functions of the current state) and, in the episodic case, are non-stationary due to the impact of a finite time-horizon. Stochastic control environments pose a challenging setting for OPE methods. For example, classical methods like importance sampling (IS) [50] struggle with deterministic target policies in continuous action spaces due to the severe policy mismatch between the target and the behavior policy (see, e.g.


Convergence of policy gradient methods for finite-horizon stochastic linear-quadratic control problems

arXiv.org Artificial Intelligence

We study the global linear convergence of policy gradient (PG) methods for finite-horizon continuous-time exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularisers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures-Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a-priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.