AITopics | mdp

Collaborating Authors

mdp

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Learning Kernel-Based MDPs from Episodic Preferential Feedback

Pavlovic, Nikola, Vakili, Sattar, Zhao, Qing

arXiv.org Machine LearningMay-26-2026

Human feedback often arrives as preferences rather than calibrated numeric rewards, motivating reinforcement learning from preferential feedback, also referred to as reinforcement learning from human feedback (RLHF). We present a rigorous theoretical study of preference-only learning in episodic kernel MDPs. In each episode, the learner deploys two policies from a common start state and receives a single binary label indicating which trajectory is preferred, modeled by a Bradley--Terry--Luce link on the difference of cumulative (unobserved) rewards. Under kernel-based assumptions on the reward and transition functions (one of the most general models amenable to theoretical analysis) we develop preference-based value estimation and confidence sets tailored to end-of-episode comparisons. We prove high-probability regret bounds that scale sublinearly in the number of episodes, implying that the value of the learned policy converges to that of the optimal policy.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Machine Learning

2605.2365

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

Mortensen, Oliver, Talebi, Mohammad Sadegh

arXiv.org Machine LearningMay-22-2026

We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-γ}$ explicit. Specifically, for $\text{CVaR}_τ$ we show that the correct dependence on $τ$ is $\frac{1}{τ^2}$, thus improving by a factor of $\frac{1}τ$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-γ}$.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Machine Learning

2605.21763

Country: Europe (0.46)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

Boudart, Pierre, Gaillard, Pierre, Rudi, Alessandro

arXiv.org Machine LearningMay-20-2026

We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $\smash{\tilde{O}(dH^2\sqrt{T})}$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $\barσ\_T \leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner's trajectory. We propose an algorithm achieving a regret of $\smash{\tilde{O}(dH^2\barσ\_T\sqrt{T})}$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $\barσ\_T = O(H^{-1})$, reducing the horizon dependence by a factor $H$. We further establish a matching $\smash{Ω(dH^2\barσ\_T\sqrt{T})}$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2605.19768

Country: Europe > France (0.46)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

Liang, Haodong, Lai, Lifeng

arXiv.org Machine LearningMay-8-2026

We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement policy-improvement methods, including semi-gradient SARSA and actor-critic, via explicit parameter constructions. Beyond existence, we design a teacher-mimicking training procedure, analyze its gradient-flow dynamics, and establish the first convergence guarantee in the ICRL literature: under suitable richness conditions on the training MDP distribution, gradient flow converges locally and exponentially to an optimal parameter manifold corresponding to the desired RL update. Empirically, training transformers on randomly generated tabular MDPs confirms these predictions: the learned models recover the parameter structure of our explicit constructions and, when deployed on unseen MDPs, deliver strong in-context control performance. Together, these results illuminate how transformer architectures internalize and execute classical reinforcement learning algorithms in context, bridging mechanistic understanding and training dynamics in ICRL.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2605.05755

Country: North America > United States > California (0.28)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

Kone, Cyrille, Jamieson, Kevin

arXiv.org Machine LearningMay-6-2026

We study the $(\varepsilon, δ)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on $\log(1/δ)$. We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the MDP. Our method achieves asymptotic optimality in sample complexity, also in terms of posterior contraction rate, and runs in $O(S^2AH)$ per episode, matching standard model-based approaches. Unlike prior algorithms such as MOCA and PEDEL, our guarantees remain meaningful in the asymptotic regime and avoid sub-optimal polynomial dependence on $\log(1/δ)$. Our results provide both theoretical insights and practical tools for efficient policy identification in tabular MDPs.

artificial intelligence, machine learning, sth, (15 more...)

arXiv.org Machine Learning

2605.03921

Genre: Research Report > New Finding (0.48)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)

Add feedback

Safe Exploration in Finite Markov Decision Processes with Gaussian Processes

Matteo Turchetta, Felix Berkenkamp, Andreas Krause

Neural Information Processing SystemsMay-1-2026, 05:56:07 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, artificial intelligence, machine learning, (15 more...)

Neural Information Processing Systems

Country:

Europe (0.46)
North America > United States (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.64)

Add feedback

Safe Policy Improvement by Minimizing Robust Baseline Regret

Mohammad Ghavamzadeh, Marek Petrik, Yinlam Chow

Neural Information Processing SystemsMay-1-2026, 05:55:55 GMT

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, which is guaranteed to outperform a given baseline strategy. In this paper, we develop and analyze a new model-based approach that computes a safe policy, given an inaccurate model of the system's dynamics and guarantees on the accuracy of this model. The new robust method uses this model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and to seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose a simple approximate algorithm. Our empirical results on several domains further show that even the simple approximate algorithm can outperform standard approaches.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

Neural Information Processing Systems

Industry: Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.34)

Add feedback

Online Robust Reinforcement Learning with Model Uncertainty

Neural Information Processing SystemsMay-1-2026, 02:16:05 GMT

Robust reinforcement learning (RL) is to find a policy that optimizes the worstcase performance over an uncertainty set of MDPs. In this paper, we focus on model-free robust RL, where the uncertainty set is defined to be centering at a misspecified MDP that generates a single sample trajectory sequentially, and is assumed to be unknown. We develop a sample-based approach to estimate the unknown uncertainty set, and design robust Q-learning algorithm (tabular case) and robust TDC algorithm (function approximation setting), which can be implemented in an online and incremental fashion. For the robust Q-learning algorithm, we prove that it converges to the optimal robust Q function, and for the robust TDC algorithm, we prove that it converges asymptotically to some stationary points. Unlike the results in [Roy et al., 2017], our algorithms do not need any additional conditions on the discount factor to guarantee the convergence. We further characterize the finite-time error bounds of the two algorithms, and show that both the robust Qlearning and robust TDC algorithms converge as fast as their vanilla counterparts (within a constant factor). Our numerical experiments further demonstrate the robustness of our algorithms. Our approach can be readily extended to robustify many other algorithms, e.g., TD, SARSA, and other GTD algorithms.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning

Neural Information Processing SystemsMay-1-2026, 01:33:10 GMT

While distributional reinforcement learning (DistRL) has been empirically effective, the question of when and why it is better than vanilla, non-distributional RL has remained unanswered. This paper explains the benefits of DistRL through the lens of small-loss bounds, which are instance-dependent bounds that scale with optimal achievable cost. Particularly, our bounds converge much faster than those from non-distributional approaches if the optimal cost is small. As warmup, we propose a distributional contextual bandit (DistCB) algorithm, which we show enjoys small-loss regret bounds and empirically outperforms the state-of-the-art on three real-world tasks. In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation. We prove that our algorithm enjoys novel small-loss PAC bounds in low-rank MDPs. As part of our analysis, we introduce the ℓ1 distributional eluder dimension which may be of independent interest. Then, in offline RL, we show that pessimistic DistRL enjoys small-loss PAC bounds that are novel to the offline setting and are more robust to bad single-policy coverage.

Add feedback

Filters

Collaborating Authors

mdp

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Learning Kernel-Based MDPs from Episodic Preferential Feedback

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

Safe Exploration in Finite Markov Decision Processes with Gaussian Processes

Safe Policy Improvement by Minimizing Robust Baseline Regret

Online Robust Reinforcement Learning with Model Uncertainty

11f9e78e4899a78dedd439fc583b6693-Paper.pdf

The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning