Planning and Learning in Average Risk-aware MDPs
–Neural Information Processing Systems
For continuing tasks, average cost Markov decision processes have welldocumented value and can be solved using efficient algorithms. However, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo (MLMC) method, and an off-policy algorithm dedicated to utility-based shortfall risk measures. Both the RVI and MLMC-based Qlearning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.
Neural Information Processing Systems
Jun-22-2026, 20:17:06 GMT
- Country:
- North America > United States (0.27)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Research Report