AITopics | average reward mdp

Collaborating Authors

average reward mdp

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Neural Information Processing SystemsDec-24-2025, 19:06:33 GMT

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs).However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies.In this paper, we present the first *tractable* algorithm with minimax optimal regret of $\mathrm{O}\left(\sqrt{\mathrm{sp}(h^*) S A T \log(SAT)}\right)$ where $\mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S\times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $\mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, **P**rojected **M**itigated **E**xtended **V**alue **I**teration (`PMEVI`), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to obtain improved regret bounds.

artificial intelligence, machine learning, tractable minimax optimal regret, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.79)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)

Add feedback

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Neural Information Processing SystemsMay-26-2025, 20:22:25 GMT

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs).However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies.In this paper, we present the first *tractable* algorithm with minimax optimal regret of \mathrm{O}\left(\sqrt{\mathrm{sp}(h *) S A T \log(SAT)}\right) where \mathrm{sp}(h *) is the span of the optimal bias function h *, S\times A is the size of the state-action space and T the number of learning steps. Remarkably, our algorithm does not require prior information on \mathrm{sp}(h *) . Our algorithm relies on a novel subroutine, **P**rojected **M**itigated **E**xtended **V**alue **I**teration ( PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to obtain improved regret bounds.

artificial intelligence, machine learning, tractable minimax optimal regret, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.86)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.66)

Add feedback

On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

Kumar, Navdeep, Murthy, Yashaswini, Shufaro, Itai, Levy, Kfir Y., Srikant, R., Mannor, Shie

arXiv.org Artificial IntelligenceMar-11-2024

We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$ regret, where $T$ represents the number of iterations. Prior work on performance bounds for discounted reward MDPs cannot be extended to average reward MDPs because the bounds grow proportional to the fifth power of the effective horizon. Thus, our primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and in obtaining finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations to empirically evaluate the performance of average reward policy gradient algorithm.

average reward, convergence, policy gradient, (15 more...)

arXiv.org Artificial Intelligence

2403.06806

Country:

Asia > Middle East > Israel (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)

Add feedback

Critic-Actor for Average Reward MDPs with Function Approximation: A Finite-Time Analysis

Panda, Prashansa, Bhatnagar, Shalabh

arXiv.org Artificial IntelligenceFeb-2-2024

In recent years, there has been a lot of research work activity focused on carrying out asymptotic and non-asymptotic convergence analyses for two-timescale actor critic algorithms where the actor updates are performed on a timescale that is slower than that of the critic. In a recent work, the critic-actor algorithm has been presented for the infinite horizon discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and asymptotic convergence analysis has been presented. In our work, we present the first critic-actor algorithm with function approximation and in the long-run average reward setting and present the first finite-time (non-asymptotic) analysis of such a scheme. We obtain optimal learning rates and prove that our algorithm achieves a sample complexity of $\mathcal{\tilde{O}}(\epsilon^{-2.08})$ for the mean squared error of the critic to be upper bounded by $\epsilon$ which is better than the one obtained for actor-critic in a similar setting. We also show the results of numerical experiments on three benchmark settings and observe that the critic-actor algorithm competes well with the actor-critic algorithm.

algorithm, average reward mdp, function approximation, (11 more...)

arXiv.org Artificial Intelligence

2402.01371

Country: Asia > India > Karnataka > Bengaluru (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.62)

Add feedback

Span-Based Optimal Sample Complexity for Average Reward MDPs

Zurek, Matthew, Chen, Yudong

arXiv.org Machine LearningNov-22-2023

We study the sample complexity of learning an $\varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. We establish the complexity bound $\widetilde{O}\left(SA\frac{H}{\varepsilon^2} \right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\varepsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. Our result is based on reducing the average-reward MDP to a discounted MDP. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\widetilde{O}\left(SA\frac{H}{(1-\gamma)^2\varepsilon^2} \right)$ samples suffice to learn a $\varepsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma \geq 1 - \frac{1}{H}$, circumventing the well-known lower bound of $\widetilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\varepsilon^2} \right)$ for general $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span parameter. These bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.

artificial intelligence, average reward mdp, span-based optimal sample complexity

arXiv.org Machine Learning

2311.13469

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence (0.87)

Add feedback