AITopics | near-optimal regret

Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning

Neural Information Processing SystemsJun-14-2026, 01:28:18 GMT

Motivated by real-world settings where data collection and policy deployment--whether for a single agent or across multiple agents--are costly, we study the problem of on-policy single-agent reinforcement learning (RL) and federated RL (FRL) with a focus on minimizing burn-in costs (the sample sizes needed to reach near-optimal regret) and policy switching or communication costs. In parallel finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states and $A$ actions, existing methods either require superlinear burn-in costs in $S$ and $A$ or fail to achieve logarithmic switching or communication costs.

artificial intelligence, machine learning, reinforcement learning, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.64)

Add feedback

9bcd1fa0c05e5f25ba7a1261f1852e82-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 00:18:05 GMT

algorithm, log 2, reinforcement learning, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Neural Information Processing SystemsDec-25-2025, 10:50:31 GMT

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

adversarial mdp, name change, near-optimal regret, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

9bcd1fa0c05e5f25ba7a1261f1852e82-Paper-Conference.pdf

Neural Information Processing SystemsAug-17-2025, 05:51:15 GMT

data mining, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Data Science > Data Mining (0.69)

Add feedback

A Biased Graph Neural Network Sampler with Near-Optimal Regret

Neural Information Processing SystemsMay-26-2025, 19:07:58 GMT

Graph neural networks (GNN) have recently emerged as a vehicle for applying deep network architectures to graph and relational data. However, given the increasing size of industrial datasets, in many practical situations, the message passing computations required for sharing information across GNN layers are no longer scalable. Although various sampling methods have been introduced to approximate full-graph training within a tractable budget, there remain unresolved complications such as high variances and limited theoretical guarantees. To address these issues, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem but with a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded pay outs. And unlike prior bandit-GNN use cases, the resulting policy leads to near-optimal regret while accounting for the GNN training dynamics introduced by SGD.

artificial intelligence, data mining, machine learning, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.65)

Add feedback

Review for NeurIPS paper: Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

Neural Information Processing SystemsFeb-7-2025, 13:38:20 GMT

Additional Feedback: Response to author feedback: From the informal discussion about the cross-component counters, I'm getting that it's somehow bad if different components have been explored unevenly and therefore encouraging more balanced exploration (pairwise) reduces overall variance in the amount of exploration between components. I'm sure there's a lot I'm not getting, but that helps a bit. I think it should be the case that you recover an object when you multiply its factors together (for the appropriate definition of "multiply"). There are papers (well, just one I can think of) that deal with truly factored MDPs that are the product of simpler MDPs. They correctly call their MDPs factored.

factored markov decision process, minimax optimal reinforcement learning, probability, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.43)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Neural Information Processing SystemsJan-19-2025, 01:04:21 GMT

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode k is revealed only in the end of episode k d k, where the delay d k can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal \sqrt{K D} regret, where K is the number of episodes and D \sum_{k 1} K d k is the total delay, significantly improving upon the best known regret bound of (K D) {2/3} .

adversarial mdp, delayed bandit feedback, near-optimal regret, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

A Biased Graph Neural Network Sampler with Near-Optimal Regret

Neural Information Processing SystemsOct-10-2024, 06:22:29 GMT

Graph neural networks (GNN) have recently emerged as a vehicle for applying deep network architectures to graph and relational data. However, given the increasing size of industrial datasets, in many practical situations, the message passing computations required for sharing information across GNN layers are no longer scalable. Although various sampling methods have been introduced to approximate full-graph training within a tractable budget, there remain unresolved complications such as high variances and limited theoretical guarantees. To address these issues, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem but with a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded pay outs. And unlike prior bandit-GNN use cases, the resulting policy leads to near-optimal regret while accounting for the GNN training dynamics introduced by SGD.

biased graph neural network sampler, near-optimal regret

Neural Information Processing Systems

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.65)

Add feedback

Reviews: Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

Neural Information Processing SystemsOct-7-2024, 08:33:21 GMT

This is an excellent theoretical contribution. The analysis is quite heavy and has many subtleties. I do not have enough time to read the appended proofs; also, the subject of the paper is not in my area of research. The comments below are based on the impression I got after reading carefully the first 8 pages of the paper and glancing through the rest in the supplementary file. Summary: This paper is about reinforcement learning in weakly-communicating MDP under the average-reward criterion.

algorithm, artificial intelligence, machine learning, (13 more...)

Neural Information Processing Systems

Industry: Energy > Oil & Gas > Upstream (0.52)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.40)

Add feedback

Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling

Provodin, Danil, Kaptein, Maurits, Pechenizkiy, Mykola

arXiv.org Artificial IntelligenceMay-29-2024

We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of $\tilde{O} (DS\sqrt{AT})$ for any communicating CMDP with $S$ states, $A$ actions, and diameter $D$. This regret bound matches the lower bound in order of time horizon $T$ and is the best-known regret bound for communicating CMDPs achieved by a computationally tractable algorithm. Empirical results show that our posterior sampling algorithm outperforms the existing algorithms for constrained reinforcement learning.

algorithm, cmdp, efficient exploration, (12 more...)

arXiv.org Artificial Intelligence

2405.19017

Country:

Europe > Austria > Vienna (0.14)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
Europe > Finland > Central Finland > Jyväskylä (0.04)
Europe > Netherlands > North Brabant > 's-Hertogenbosch (0.04)

Genre: Research Report > New Finding (0.48)

Add feedback

Filters

Collaborating Authors

near-optimal regret

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning

9bcd1fa0c05e5f25ba7a1261f1852e82-Paper-Conference.pdf

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

9bcd1fa0c05e5f25ba7a1261f1852e82-Paper-Conference.pdf

A Biased Graph Neural Network Sampler with Near-Optimal Regret

Review for NeurIPS paper: Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

A Biased Graph Neural Network Sampler with Near-Optimal Regret

Reviews: Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling