Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models
Russo, Alessio, Metelli, Alberto Maria, Restelli, Marcello
Reinforcement Learning (RL) (Sutton and Barto, We tackle average-reward infinite-horizon 2018) tackles the sequential decision-making problem POMDPs with an unknown transition model of an agent interacting with an unknown or partially but a known observation model, a setting known environment with the goal of maximizing the that has been previously addressed in two long-term sum of rewards. The RL agent should tradeoff limiting ways: (i) frequentist methods relying between exploring the environment to learn its on suboptimal stochastic policies having structure and exploiting the estimates to compute a a minimum probability of choosing each action, policy that maximizes the reward. This problem has and (ii) Bayesian approaches employing been successfully addressed in past works under the the optimal policy class but requiring MDP formulation (Bartlett and Tewari, 2009; Jaksch strong assumptions about the consistency et al., 2010; Zanette and Brunskill, 2019). MDPs assume of employed estimators. Our work removes full observability of the state space but this assumption these limitations by proving convenient estimation is often violated in many real-world scenarios guarantees for the transition model such as robotics or finance, where only a partial and introducing an optimistic algorithm that observation of the environment is available. In this leverages the optimal class of deterministic case, it is more appropriate to model the problem using belief-based policies. We introduce modifications Partially-Observable MDPs (Sondik, 1978).
Jan-30-2025
- Country:
- Asia (0.28)
- Europe > Italy (0.14)
- North America > United States (0.14)
- Genre:
- Research Report (1.00)