Goto

Collaborating Authors

 risk-sensitive rl


Risk-SensitiveReinforcementLearning: Near-OptimalRisk-SampleTradeoffinRegret

Neural Information Processing Systems

We study risk-sensitive reinforcement learning in episodic Markov decision processes with unknown transition kernels, where the goal is to optimize the total reward under the risk measure of exponential utility. We propose two provably efficient model-free algorithms, Risk-Sensitive Value Iteration (RSVI) and Risk-Sensitive Q-learning (RSQ). These algorithms implement a form of risk-sensitive optimism in the face of uncertainty, which adapts to both riskseeking and risk-averse modes of exploration.


Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models

Jiang, Yuhua, Huang, Jiawei, Yuan, Yufeng, Mao, Xin, Yue, Yu, Zhao, Qianchuan, Yan, Lin

arXiv.org Artificial Intelligence

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. However, existing methods suffer from an exploration dilemma: the sharply peaked initial policies of pre-trained LLMs confine standard RL algorithms to a narrow set of solutions, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. To overcome this, we introduce a Risk-Sensitive Reinforcement Learning framework. Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm, Risk-Sensitive GRPO (RS-GRPO), which drives deeper exploration by amplifying learning from challenging prompts. Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications. On six mathematical reasoning benchmarks and with five different LLMs, RS-GRPO consistently improves pass@k performance while maintaining or enhancing pass@1 accuracy.


Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret

Neural Information Processing Systems

We study risk-sensitive reinforcement learning in episodic Markov decision processes with unknown transition kernels, where the goal is to optimize the total reward under the risk measure of exponential utility. We propose two provably efficient model-free algorithms, Risk-Sensitive V alue Iteration (RSVI) and Risk-Sensitive Q-learning (RSQ). These algorithms implement a form of risk-sensitive optimism in the face of uncertainty, which adapts to both risk-seeking and risk-averse modes of exploration.


Robust Risk-Sensitive Reinforcement Learning with Conditional Value-at-Risk

Ni, Xinyi, Lai, Lifeng

arXiv.org Machine Learning

Robust Markov Decision Processes (RMDPs) have received significant research interest, offering an alternative to standard Markov Decision Processes (MDPs) that often assume fixed transition probabilities. RMDPs address this by optimizing for the worst-case scenarios within ambiguity sets. While earlier studies on RMDPs have largely centered on risk-neutral reinforcement learning (RL), with the goal of minimizing expected total discounted costs, in this paper, we analyze the robustness of CVaR-based risk-sensitive RL under RMDP. Firstly, we consider predetermined ambiguity sets. Based on the coherency of CVaR, we establish a connection between robustness and risk sensitivity, thus, techniques in risk-sensitive RL can be adopted to solve the proposed problem. Furthermore, motivated by the existence of decision-dependent uncertainty in real-world problems, we study problems with state-action-dependent ambiguity sets. To solve this, we define a new risk measure named NCVaR and build the equivalence of NCVaR optimization and robust CVaR optimization. We further propose value iteration algorithms and validate our approach in simulation experiments.


Provably Efficient CVaR RL in Low-rank MDPs

Zhao, Yulai, Zhan, Wenhao, Hu, Xiaoyan, Leung, Ho-fung, Farnia, Farzan, Sun, Wen, Lee, Jason D.

arXiv.org Machine Learning

We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize the Conditional Value at Risk (CVaR) with a fixed risk tolerance $\tau$. Prior theoretical work studying risk-sensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVaR RL to settings where state space is large, function approximation must be deployed. We study CVaR RL in low-rank MDPs with nonlinear function approximation. Low-rank MDPs assume the underlying transition kernel admits a low-rank decomposition, but unlike prior linear models, low-rank MDPs do not assume the feature or state-action representation is known. We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to carefully balance the interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $\tilde{O}\left(\frac{H^7 A^2 d^4}{\tau^2 \epsilon^2}\right)$ to yield an $\epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations. Computational-wise, we design a novel discretized Least-Squares Value Iteration (LSVI) algorithm for the CVaR objective as the planning oracle and show that we can find the near-optimal policy in a polynomial running time with a Maximum Likelihood Estimation oracle. To our knowledge, this is the first provably efficient CVaR RL algorithm in low-rank MDPs.


Non-ergodicity in reinforcement learning: robustness via ergodicity transformations

Baumann, Dominik, Noorani, Erfaun, Price, James, Peters, Ole, Connaughton, Colm, Schön, Thomas B.

arXiv.org Artificial Intelligence

Envisioned application areas for reinforcement learning (RL) include autonomous driving, precision agriculture, and finance, which all require RL agents to make decisions in the real world. A significant challenge hindering the adoption of RL methods in these domains is the non-robustness of conventional algorithms. In this paper, we argue that a fundamental issue contributing to this lack of robustness lies in the focus on the expected value of the return as the sole "correct" optimization objective. The expected value is the average over the statistical ensemble of infinitely many trajectories. For non-ergodic returns, this average differs from the average over a single but infinitely long trajectory. Consequently, optimizing the expected value can lead to policies that yield exceptionally high returns with probability zero but almost surely result in catastrophic outcomes. This problem can be circumvented by transforming the time series of collected returns into one with ergodic increments. This transformation enables learning robust policies by optimizing the long-term return for individual agents rather than the average across infinitely many trajectories. We propose an algorithm for learning ergodicity transformations from data and demonstrate its effectiveness in an instructive, non-ergodic environment and on standard RL benchmarks.


Regret Bounds for Markov Decision Processes with Recursive Optimized Certainty Equivalents

Xu, Wenhao, Gao, Xuefeng, He, Xuedong

arXiv.org Artificial Intelligence

Reinforcement learning (RL) studies the problem of sequential decision making in an unknown environment by carefully balancing between exploration and exploitation (Sutton and Barto 2018). In the classical setting, it describes how an agent takes actions to maximize expected cumulative rewards in an environment typically modeled by a Markov decision process (MDP, Puterman (2014)). However, optimizing the expected cumulative rewards alone is often not sufficient in many practical applications such as finance, healthcare and robotics. Hence, it may be necessary to take into account of the risk preferences of the agent in the dynamic decision process. Indeed, a rich body of literature has studied risk-sensitive (and safe) RL, incorporating risk measures such as the entropic risk measure and conditional value-at-risk (CVaR) in the decision criterion, see, e.g., Shen et al. (2014), Garcıa and Fernández (2015), Tamar et al. (2016), Chow et al. (2017), Prashanth L and Fu (2018), Fei et al. (2020) and the references therein. In this paper we study risk-sensitive RL for tabular MDPs with unknown transition probabilities in the finite-horizon, episodic setting, where an agent interacts with the MDP in episodes of a fixed length with finite state and action spaces. To incorporate risk sensitivity, we consider a broad and important class of risk measures known as Optimized Certainty Equivalent (OCE, (Ben-Tal and Teboulle 1986, 2007)). The OCE is a (nonlinear) risk function which assigns a random variable X to a real value, and it depends on a concave utility function, see Equation (1) for the definition.


Cascaded Gaps: Towards Gap-Dependent Regret for Risk-Sensitive Reinforcement Learning

Fei, Yingjie, Xu, Ruitu

arXiv.org Machine Learning

In this paper, we study gap-dependent regret guarantees for risk-sensitive reinforcement learning based on the entropic risk measure. We propose a novel definition of sub-optimality gaps, which we call cascaded gaps, and we discuss their key components that adapt to the underlying structures of the problem. Based on the cascaded gaps, we derive non-asymptotic and logarithmic regret bounds for two model-free algorithms under episodic Markov decision processes. We show that, in appropriate settings, these bounds feature exponential improvement over existing ones that are independent of gaps. We also prove gap-dependent lower bounds, which certify the near optimality of the upper bounds.