AITopics

Country: North America (0.68)

Industry: Transportation (0.30)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceSep-9-2025

Teaching Precommitted Agents: Model-Free Policy Evaluation and Control in Quasi-Hyperbolic Discounted MDPs

Eshwar, S. R.

Abstract-- Time-inconsistent preferences, where agents favor smaller-sooner over larger-later rewards, are a key feature of human and animal decision-making. Quasi-Hyperbolic (QH) discounting provides a simple yet powerful model for this behavior, but its integration into the reinforcement learning (RL) framework has been limited. We make two primary contributions: (i) we formally characterize the structure of the optimal policy, proving for the first time that it reduces to a simple one-step non-stationary form; and (ii) we design the first practical, model-free algorithms for both policy evaluation and Q-learning in this setting, both with provable convergence guarantees. Our results provide foundational insights for incorporating QH preferences in RL. Reinforcement learning (RL) [12] provides a powerful framework for sequential decision-making, where an agent interacts with an environment to maximize long-term cumulative rewards.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2509.06094

Country: Asia > India (0.14)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Adusumilli, Karun, Geiecke, Friedrich, Schilter, Claudio

Dynamically Optimal Treatment Allocation

arXiv.org Artificial IntelligenceNov-25-2024

Many critical treatment assignment problems are inherently dynamic in nature. For example, consider the allocation of job-training programs to newly unemployed individuals. Such programs are often funded through a fixed yearly budget determined in advance by the legislature, while job-seekers arrive at job centers sequentially throughout the year. Policymakers must decide whether to allocate training to each individual, recognizing that each decision affects the remaining budget and the availability of training for future applicants who may benefit even more. In essence, policymakers face a dynamic optimization problem, balancing immediate needs against long-term budgetary constraints.

boundary condition, budget, viscosity solution, (16 more...)

1904.01047

Country:

Europe > Sweden (0.04)
North America > United States > Pennsylvania (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Education (1.00)
Banking & Finance > Economy (1.00)
Government > Regional Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Neural Information Processing SystemsMar-14-2024, 11:54:56 GMT

On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy.

algorithm, non-stationary policy, policy iteration, (10 more...)

Country:

Europe > France (0.04)
North America > United States > New York (0.04)
North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.71)

Neural Information Processing SystemsFeb-16-2024, 07:06:42 GMT

On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

We consider infinite-horizon stationary \gamma -discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error \epsilon at each iteration, it is well-known that one can compute stationary policies that are \frac{2\gamma{(1-\gamma) 2}\epsilon -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to \frac{2\gamma}{1-\gamma}\epsilon -optimal, which constitutes a significant improvement in the usual situation when \gamma is close to 1 . Surprisingly, this shows that the problem of computing near-optimal non-stationary policies'' is much simpler than that of computing near-optimal stationary policies''.

gamma, non-stationary policy, stationary infinite-horizon markov decision process, (3 more...)

Technology:

Information Technology > Decision Support Systems (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)

Li, Yuanzhi, Wang, Ruosong, Yang, Lin F.

Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning

arXiv.org Artificial IntelligenceOct-31-2021

Reinforcement learning (RL) is one of the most important paradigms in machine learning. What makes RL different from other paradigms is that it models the long-term effects in decision-making problems. For instance, in a finite-horizon Markov decision process (MDP), which is one of the most fundamental models for RL, an agent interacts with the environment for a total of H steps and receives a sequence of H random reward values, along with stochastic state transitions, as feedback. The goal of the agent is to find a policy to maximize the expected sum of these rewards values instead of any single one of them. Since decisions made at early stages could significantly impact the future, the agent must take possible future transitions into consideration when choosing the policy. On the other hand, when H 1, RL reduces to the contextual bandits problem in which it suffices to act myopically to achieve optimality. Due to the important role of the horizon length in RL, Jiang and Agarwal [JA18] propose to study how the sample complexity of RL depends on the horizon length. More formally, let us consider the episodic RL setting, where the horizon length is H and the underlying MDP has unknown and time invariant transition probabilities and rewards.

algorithm, probability, sample complexity, (12 more...)

2111.00633

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.49)

Scherrer, Bruno, Lesner, Boris

On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Neural Information Processing SystemsFeb-14-2020, 23:13:23 GMT

We consider infinite-horizon stationary $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error $\epsilon$ at each iteration, it is well-known that one can compute stationary policies that are $\frac{2\gamma{(1-\gamma) 2}\epsilon$-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to $\frac{2\gamma}{1-\gamma}\epsilon$-optimal, which constitutes a significant improvement in the usual situation when $\gamma$ is close to $1$. Surprisingly, this shows that the problem of computing near-optimal non-stationary policies'' is much simpler than that of computing near-optimal stationary policies''. Papers published at the Neural Information Processing Systems Conference.

gamma, non-stationary policy, stationary infinite-horizon markov decision process, (3 more...)

Technology:

Information Technology > Decision Support Systems (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)

arXiv.org Artificial IntelligenceFeb-7-2019

Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach

Pitis, Silviu

Reinforcement learning (RL) agents have traditionally been tasked with maximizing the value function of a Markov decision process (MDP), either in continuous settings, with fixed discount factor $\gamma < 1$, or in episodic settings, with $\gamma = 1$. While this has proven effective for specific tasks with well-defined objectives (e.g., games), it has never been established that fixed discounting is suitable for general purpose use (e.g., as a model of human preferences). This paper characterizes rationality in sequential decision making using a set of seven axioms and arrives at a form of discounting that generalizes traditional fixed discounting. In particular, our framework admits a state-action dependent "discount" factor that is not constrained to be less than 1, so long as there is eventual long run discounting. Although this broadens the range of possible preference structures in continuous settings, we show that there exists a unique "optimizing MDP" with fixed $\gamma < 1$ whose optimal value function matches the true utility of the optimal policy, and we quantify the difference between value and utility for suboptimal policies. Our work can be seen as providing a normative justification for (a slight generalization of) Martha White's RL task formalism (2017) and other recent departures from the traditional RL, and is relevant to task specification in RL, inverse RL and preference-based RL.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

1902.02893

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States (0.05)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Scherrer, Bruno, Lesner, Boris

On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Neural Information Processing SystemsDec-31-2012

We consider infinite-horizon stationary $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error $\epsilon$ at each iteration, it is well-known that one can compute stationary policies that are $\frac{2\gamma{(1-\gamma)^2}\epsilon$-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to $\frac{2\gamma}{1-\gamma}\epsilon$-optimal, which constitutes a significant improvement in the usual situation when $\gamma$ is close to $1$. Surprisingly, this shows that the problem of ``computing near-optimal non-stationary policies'' is much simpler than that of ``computing near-optimal stationary policies''.

artificial intelligence, machine learning, non-stationary policy, (10 more...)

Country: North America > United States (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.71)

arXiv.org Artificial IntelligenceMar-30-2012

On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes

Scherrer, Bruno

We consider infinite-horizon $\gamma$-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies $\pi_1,...,\pi_k$ it implicitely generates until some iteration $k$. We provide performance bounds for non-stationary policies involving the last $m$ generated policies that reduce the state-of-the-art bound for the last stationary policy $\pi_k$ by a factor $\frac{1-\gamma}{1-\gamma^m}$. In particular, the use of non-stationary policies allows to reduce the usual asymptotic performance bounds of Value Iteration with errors bounded by $\epsilon$ at each iteration from $\frac{\gamma}{(1-\gamma)^2}\epsilon$ to $\frac{\gamma}{1-\gamma}\epsilon$, which is significant in the usual situation when $\gamma$ is close to 1. Given Bellman operators that can only be computed with some error $\epsilon$, a surprising consequence of this result is that the problem of "computing an approximately optimal non-stationary policy" is much simpler than that of "computing an approximately optimal stationary policy", and even slightly simpler than that of "approximately computing the value of some fixed policy", since this last problem only has a guarantee of $\frac{1}{1-\gamma}\epsilon$.

artificial intelligence, machine learning, non-stationary policy, (16 more...)

1203.5532

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.62)