Goto

Collaborating Authors

 skh


Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

arXiv.org Machine Learning

We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^K [V^{*,M^k} - V^{ฯ€^k,M^k}]$, where $M^k$ represents the action context in the $k$-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $\widetilde{O}(\sqrt{SAH^3K\log L})$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $\widetilde{O}(\sqrt{SAH^3K})$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $\widetilde{O}(SAH^3/ฮต^2)$ for a fixed context distribution. In addition, we derive a gap-dependent regret bound of \[ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{ฮ”_{\min}^{p}} + pKฮ”_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), \] where $ฮ”_{\min}^{p}$ is the global $p$-trimmed positive-gap floor over suboptimal $(h,s,a)$ triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.


Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

arXiv.org Machine Learning

We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $ฮด\in (0,1]$, thereby characterizing the regret distribution across the full range of $ฮด$. We present a simple UCBVI-style algorithm with exploration bonus $\min\{c_{1,k}/N, c_{2,k}/\sqrt{N}\}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}(\sqrt{AT}\log(1/ฮด))$, confirming the conjecture of Lattimore & Szepesvรกri (2020, Section 17.1) for the first time.



Near-OptimalRegretforAdversarialMDPwith DelayedBanditFeedback

Neural Information Processing Systems

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observedindelay.



ProvablyEfficientCausalReinforcementLearning withConfoundedObservationalData

Neural Information Processing Systems

Empowered by neural networks, deep reinforcement learning (DRL) achieves tremendous empirical success. However, DRL requires a large dataset by interacting with the environment, which is unrealistic in critical scenarios such as autonomous driving and personalized medicine. In this paper, we study how to incorporate the dataset collected in the offline setting to improve the sample efficiency in the online setting. To incorporate the observational data, we face two challenges.


OnReward-FreeReinforcementLearningwith LinearFunctionApproximation

Neural Information Processing Systems

During the exploration phase, an agent collects samples without using a pre-specified reward function. After the exploration phase, a reward function is given, and the agent uses samples collected during the exploration phase to computeanear-optimalpolicy.



Uniform-PACBoundsforReinforcementLearning withLinearFunctionApproximation

Neural Information Processing Systems

Designing efficient reinforcement learning (RL) algorithms for environments with large state and action spaces is one of the main tasks in the RL community.


ProvablyEfficientReinforcementLearningwith LinearFunctionApproximationunderAdaptivity Constraints

Neural Information Processing Systems

Real-world reinforcement learning (RL) applications often come with possibly infinite state and action space, and in such a situation classical RL algorithms developed in the tabular setting are not applicable anymore. A popular approach to overcoming this issue is by applying function approximation techniques to the underlying structures of the Markovdecision processes (MDPs).