Generalization in Monitored Markov Decision Processes (Mon-MDPs)
Mohammedalamen, Montaser, Bowling, Michael
–arXiv.org Artificial Intelligence
Reinforcement learning (RL) typically models the interaction between the agent and environment as a Markov decision process (MDP), where the rewards that guide the agent's behavior are always observable. However, in many real-world scenarios, rewards are not always observable, which can be modeled as a monitored Markov decision process (Mon-MDP). Prior work on Mon-MDPs have been limited to simple, tabular cases, restricting their applicability to real-world problems. This work explores Mon-MDPs using function approximation (FA) and investigates the challenges involved. We show that combining function approximation with a learned reward model enables agents to generalize from monitored states with observable rewards, to unmonitored environment states with unobservable rewards. Therefore, we demonstrate that such generalization with a reward model achieves near-optimal policies in environments formally defined as unsolvable. However, we identify a critical limitation of such function approximation, where agents incorrectly extrapolate rewards due to overgeneralization, resulting in undesirable behaviors. To mitigate overgeneralization, we propose a cautious police optimization method leveraging reward uncertainty. This work serves as a step towards bridging this gap between Mon-MDP theory and real-world applications.
arXiv.org Artificial Intelligence
May-15-2025
- Country:
- North America > Canada
- Europe > United Kingdom
- England > Oxfordshire > Oxford (0.04)
- Asia > Middle East
- Jordan (0.04)
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Leisure & Entertainment > Games (0.67)
- Technology: