AITopics | episodic mdp

Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

Neural Information Processing SystemsJun-14-2026, 14:10:46 GMT

We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve O(logT) regret in stochastic settings and O( T) regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknowntransition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.

algorithm, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.65)

Industry: Education > Educational Setting > Online (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.71)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.45)

Add feedback

000c076c390a4c357313fca29e390ece-Paper.pdf

Neural Information Processing SystemsApr-24-2026, 09:32:12 GMT

machine learning, reinforcement learning, state-action pair, (13 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.50)

Add feedback

60ce36723c17bbac504f2ef4c8a46995-Supplemental.pdf

Neural Information Processing SystemsFeb-19-2026, 03:16:27 GMT

arxiv preprint arxiv, inequality, mdp, (12 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Data Science (0.68)

Add feedback

Is Q-Learning Provably Efficient?

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, Michael I. Jordan

Neural Information Processing SystemsFeb-14-2026, 17:46:20 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, model-free algorithm, q-learning, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Provably Efficient Q-Learning with Low Switching Cost

Yu Bai, Tengyang Xie, Nan Jiang, Yu-Xiang Wang

Neural Information Processing SystemsFeb-12-2026, 01:48:17 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, local switching cost, switching cost, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Health & Medicine (0.95)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

Neural Information Processing SystemsDec-25-2025, 00:48:21 GMT

This paper establishes that optimistic algorithms attain gap-dependent and non-asymptotic logarithmic regret for episodic MDPs. In contrast to prior work, our bounds do not suffer a dependence on diameter-like quantities or ergodicity, and smoothly interpolate between the gap dependent logarithmic-regret, and the $\widetilde{\mathcal{O}}(\sqrt{HSAT})$-minimax rate. The key technique in our analysis is a novel ``clipped'' regret decomposition which applies to a broad family of recent optimistic algorithms for episodic MDPs.

name change, non-asymptotic gap-dependent regret bound, tabular mdp, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.88)

Add feedback

Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

Neural Information Processing SystemsNov-21-2025, 14:07:58 GMT

The environment and an agent's interactions are typically modeled as a Markov

algorithm, mdp, state-action pair, (15 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.83)

Add feedback

Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

Christoph Dann, Tor Lattimore, Emma Brunskill

Neural Information Processing SystemsNov-21-2025, 05:06:49 GMT

Tor Lattimore is now at DeepMind, London 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Uniform-P AC algorithms suffer none of these drawbacks.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Long Beach (0.24)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Is Q-Learning Provably Efficient?

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, Michael I. Jordan

Neural Information Processing SystemsNov-20-2025, 20:18:52 GMT

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn [7, 22]. The theoretical question of "whether model-free algorithms can be made sample efficient " is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

Ito, Shinji, Jamieson, Kevin, Luo, Haipeng, Maiti, Arnab, Tsuchiya, Taira

arXiv.org Machine LearningOct-28-2025

We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve $O(\log T)$ regret in stochastic settings and ${O}(\sqrt{T})$ regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

2510.17103

Genre: Research Report > New Finding (0.34)

Industry: Education > Educational Setting (0.34)

Technology: