Efficient Algorithms for Mitigating Uncertainty and Risk in Reinforcement Learning
–arXiv.org Artificial Intelligence
This dissertation makes three main contributions. First, We identify a new connection between policy gradient and dynamic programming in MMDPs and propose the Coordinate Ascent Dynamic Programming (CADP) algorithm to compute a Markov policy that maximizes the discounted return averaged over the uncertain models. CADP adjusts model weights iteratively to guarantee monotone policy improvements to a local maximum. Second, We establish sufficient and necessary conditions for the exponential ERM Bellman operator to be a contraction and prove the existence of stationary deterministic optimal policies for ERM-TRC and EVaR-TRC. We also propose exponential value iteration, policy iteration, and linear programming algorithms for computing optimal stationary policies for ERM-TRC and EVaR-TRC. Third, We propose model-free Q-learning algorithms for computing policies with risk-averse objectives: ERM-TRC and EVaR-TRC. The challenge is that Q-learning ERM Bellman may not be a contraction. Instead, we use the monotonicity of Q-learning ERM Bellman operators to derive a rigorous proof that the ERM-TRC and the EVaR-TRC Q-learning algorithms converge to the optimal risk-averse value functions. The proposed Q-learning algorithms compute the optimal stationary policy for ERM-TRC and EVaR-TRC.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- Asia > China
- Heilongjiang Province > Harbin (0.04)
- Europe
- Netherlands > South Holland
- Leiden (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Netherlands > South Holland
- North America > United States
- California (0.04)
- Massachusetts (0.04)
- New Hampshire (0.04)
- Asia > China
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine > Therapeutic Area (1.00)
- Information Technology (0.67)
- Transportation > Ground
- Road (0.92)