mean-variance
Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP
Sangadi, Tejaram, Prashanth, L. A., Jagannathan, Krishna
In the standard reinforcement learning (RL) setting, the objective is to learn a policy that maximizes the value function, which is the expectation of the cumulative reward that is obtained over a finite or infinite time horizon. However, in several practical scenarios including finance, automated driving and drug testing, a risk sensitive learning paradigm assumes importance, wherein the value function, which is an expectation, needs to be traded off suitably with an appropriate risk metric associated with the reward distribution. One way to achieve this is to solve a constrained optimization problem with this risk metric as a constraint, and the value function as the objective. Variance is a popular risk measure, which is usually incorporated into a risk-sensitive optimization problem as a constraint, with the usual expected value as the objective. Such a mean-variance formulation was studied in the seminal work of Markowitz [10]. In the context of RL, mean-variance optimization has been considered in several previous works, cf.