non-stationary policy
Teaching Precommitted Agents: Model-Free Policy Evaluation and Control in Quasi-Hyperbolic Discounted MDPs
Abstract-- Time-inconsistent preferences, where agents favor smaller-sooner over larger-later rewards, are a key feature of human and animal decision-making. Quasi-Hyperbolic (QH) discounting provides a simple yet powerful model for this behavior, but its integration into the reinforcement learning (RL) framework has been limited. We make two primary contributions: (i) we formally characterize the structure of the optimal policy, proving for the first time that it reduces to a simple one-step non-stationary form; and (ii) we design the first practical, model-free algorithms for both policy evaluation and Q-learning in this setting, both with provable convergence guarantees. Our results provide foundational insights for incorporating QH preferences in RL. Reinforcement learning (RL) [12] provides a powerful framework for sequential decision-making, where an agent interacts with an environment to maximize long-term cumulative rewards.
Dynamically Optimal Treatment Allocation
Adusumilli, Karun, Geiecke, Friedrich, Schilter, Claudio
Many critical treatment assignment problems are inherently dynamic in nature. For example, consider the allocation of job-training programs to newly unemployed individuals. Such programs are often funded through a fixed yearly budget determined in advance by the legislature, while job-seekers arrive at job centers sequentially throughout the year. Policymakers must decide whether to allocate training to each individual, recognizing that each decision affects the remaining budget and the availability of training for future applicants who may benefit even more. In essence, policymakers face a dynamic optimization problem, balancing immediate needs against long-term budgetary constraints.
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
We consider infinite-horizon stationary \gamma -discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error \epsilon at each iteration, it is well-known that one can compute stationary policies that are \frac{2\gamma{(1-\gamma) 2}\epsilon -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to \frac{2\gamma}{1-\gamma}\epsilon -optimal, which constitutes a significant improvement in the usual situation when \gamma is close to 1 . Surprisingly, this shows that the problem of computing near-optimal non-stationary policies'' is much simpler than that of computing near-optimal stationary policies''.
Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning
Li, Yuanzhi, Wang, Ruosong, Yang, Lin F.
Reinforcement learning (RL) is one of the most important paradigms in machine learning. What makes RL different from other paradigms is that it models the long-term effects in decision-making problems. For instance, in a finite-horizon Markov decision process (MDP), which is one of the most fundamental models for RL, an agent interacts with the environment for a total of H steps and receives a sequence of H random reward values, along with stochastic state transitions, as feedback. The goal of the agent is to find a policy to maximize the expected sum of these rewards values instead of any single one of them. Since decisions made at early stages could significantly impact the future, the agent must take possible future transitions into consideration when choosing the policy. On the other hand, when H 1, RL reduces to the contextual bandits problem in which it suffices to act myopically to achieve optimality. Due to the important role of the horizon length in RL, Jiang and Agarwal [JA18] propose to study how the sample complexity of RL depends on the horizon length. More formally, let us consider the episodic RL setting, where the horizon length is H and the underlying MDP has unknown and time invariant transition probabilities and rewards.
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
Scherrer, Bruno, Lesner, Boris
We consider infinite-horizon stationary $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error $\epsilon$ at each iteration, it is well-known that one can compute stationary policies that are $\frac{2\gamma{(1-\gamma) 2}\epsilon$-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to $\frac{2\gamma}{1-\gamma}\epsilon$-optimal, which constitutes a significant improvement in the usual situation when $\gamma$ is close to $1$. Surprisingly, this shows that the problem of computing near-optimal non-stationary policies'' is much simpler than that of computing near-optimal stationary policies''. Papers published at the Neural Information Processing Systems Conference.
Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach
Reinforcement learning (RL) agents have traditionally been tasked with maximizing the value function of a Markov decision process (MDP), either in continuous settings, with fixed discount factor $\gamma < 1$, or in episodic settings, with $\gamma = 1$. While this has proven effective for specific tasks with well-defined objectives (e.g., games), it has never been established that fixed discounting is suitable for general purpose use (e.g., as a model of human preferences). This paper characterizes rationality in sequential decision making using a set of seven axioms and arrives at a form of discounting that generalizes traditional fixed discounting. In particular, our framework admits a state-action dependent "discount" factor that is not constrained to be less than 1, so long as there is eventual long run discounting. Although this broadens the range of possible preference structures in continuous settings, we show that there exists a unique "optimizing MDP" with fixed $\gamma < 1$ whose optimal value function matches the true utility of the optimal policy, and we quantify the difference between value and utility for suboptimal policies. Our work can be seen as providing a normative justification for (a slight generalization of) Martha White's RL task formalism (2017) and other recent departures from the traditional RL, and is relevant to task specification in RL, inverse RL and preference-based RL.
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
Scherrer, Bruno, Lesner, Boris
We consider infinite-horizon stationary $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error $\epsilon$ at each iteration, it is well-known that one can compute stationary policies that are $\frac{2\gamma{(1-\gamma)^2}\epsilon$-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to $\frac{2\gamma}{1-\gamma}\epsilon$-optimal, which constitutes a significant improvement in the usual situation when $\gamma$ is close to $1$. Surprisingly, this shows that the problem of ``computing near-optimal non-stationary policies'' is much simpler than that of ``computing near-optimal stationary policies''.
On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes
We consider infinite-horizon $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies $\pi_1,...,\pi_k$ it implicitely generates until some iteration $k$. We provide performance bounds for non-stationary policies involving the last $m$ generated policies that reduce the state-of-the-art bound for the last stationary policy $\pi_k$ by a factor $\frac{1-\gamma}{1-\gamma^m}$. In particular, the use of non-stationary policies allows to reduce the usual asymptotic performance bounds of Value Iteration with errors bounded by $\epsilon$ at each iteration from $\frac{\gamma}{(1-\gamma)^2}\epsilon$ to $\frac{\gamma}{1-\gamma}\epsilon$, which is significant in the usual situation when $\gamma$ is close to 1. Given Bellman operators that can only be computed with some error $\epsilon$, a surprising consequence of this result is that the problem of "computing an approximately optimal non-stationary policy" is much simpler than that of "computing an approximately optimal stationary policy", and even slightly simpler than that of "approximately computing the value of some fixed policy", since this last problem only has a guarantee of $\frac{1}{1-\gamma}\epsilon$.