near-optimal regret
Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning
Motivated by real-world settings where data collection and policy deployment--whether for a single agent or across multiple agents--are costly, we study the problem of on-policy single-agent reinforcement learning (RL) and federated RL (FRL) with a focus on minimizing burn-in costs (the sample sizes needed to reach near-optimal regret) and policy switching or communication costs. In parallel finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states and $A$ actions, existing methods either require superlinear burn-in costs in $S$ and $A$ or fail to achieve logarithmic switching or communication costs.
Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.
A Biased Graph Neural Network Sampler with Near-Optimal Regret
Graph neural networks (GNN) have recently emerged as a vehicle for applying deep network architectures to graph and relational data. However, given the increasing size of industrial datasets, in many practical situations, the message passing computations required for sharing information across GNN layers are no longer scalable. Although various sampling methods have been introduced to approximate full-graph training within a tractable budget, there remain unresolved complications such as high variances and limited theoretical guarantees. To address these issues, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem but with a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded pay outs. And unlike prior bandit-GNN use cases, the resulting policy leads to near-optimal regret while accounting for the GNN training dynamics introduced by SGD.
Review for NeurIPS paper: Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes
Additional Feedback: Response to author feedback: From the informal discussion about the cross-component counters, I'm getting that it's somehow bad if different components have been explored unevenly and therefore encouraging more balanced exploration (pairwise) reduces overall variance in the amount of exploration between components. I'm sure there's a lot I'm not getting, but that helps a bit. I think it should be the case that you recover an object when you multiply its factors together (for the appropriate definition of "multiply"). There are papers (well, just one I can think of) that deal with truly factored MDPs that are the product of simpler MDPs. They correctly call their MDPs factored.
Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode k is revealed only in the end of episode k d k, where the delay d k can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal \sqrt{K D} regret, where K is the number of episodes and D \sum_{k 1} K d k is the total delay, significantly improving upon the best known regret bound of (K D) {2/3} .
A Biased Graph Neural Network Sampler with Near-Optimal Regret
Graph neural networks (GNN) have recently emerged as a vehicle for applying deep network architectures to graph and relational data. However, given the increasing size of industrial datasets, in many practical situations, the message passing computations required for sharing information across GNN layers are no longer scalable. Although various sampling methods have been introduced to approximate full-graph training within a tractable budget, there remain unresolved complications such as high variances and limited theoretical guarantees. To address these issues, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem but with a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded pay outs. And unlike prior bandit-GNN use cases, the resulting policy leads to near-optimal regret while accounting for the GNN training dynamics introduced by SGD.
Reviews: Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes
This is an excellent theoretical contribution. The analysis is quite heavy and has many subtleties. I do not have enough time to read the appended proofs; also, the subject of the paper is not in my area of research. The comments below are based on the impression I got after reading carefully the first 8 pages of the paper and glancing through the rest in the supplementary file. Summary: This paper is about reinforcement learning in weakly-communicating MDP under the average-reward criterion.
Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling
Provodin, Danil, Kaptein, Maurits, Pechenizkiy, Mykola
We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of $\tilde{O} (DS\sqrt{AT})$ for any communicating CMDP with $S$ states, $A$ actions, and diameter $D$. This regret bound matches the lower bound in order of time horizon $T$ and is the best-known regret bound for communicating CMDPs achieved by a computationally tractable algorithm. Empirical results show that our posterior sampling algorithm outperforms the existing algorithms for constrained reinforcement learning.