Generalization in Monitored Markov Decision Processes (Mon-MDPs)

Mohammedalamen, Montaser, Bowling, Michael

May-15-2025–arXiv.org Artificial Intelligence

Reinforcement learning (RL) typically models the interaction between the agent and environment as a Markov decision process (MDP), where the rewards that guide the agent's behavior are always observable. However, in many real-world scenarios, rewards are not always observable, which can be modeled as a monitored Markov decision process (Mon-MDP). Prior work on Mon-MDPs have been limited to simple, tabular cases, restricting their applicability to real-world problems. This work explores Mon-MDPs using function approximation (FA) and investigates the challenges involved. We show that combining function approximation with a learned reward model enables agents to generalize from monitored states with observable rewards, to unmonitored environment states with unobservable rewards. Therefore, we demonstrate that such generalization with a reward model achieves near-optimal policies in environments formally defined as unsolvable. However, we identify a critical limitation of such function approximation, where agents incorrectly extrapolate rewards due to overgeneralization, resulting in undesirable behaviors. To mitigate overgeneralization, we propose a cautious police optimization method leveraging reward uncertainty. This work serves as a step towards bridging this gap between Mon-MDP theory and real-world applications.

machine learning, reinforcement learning, reward model, (15 more...)

arXiv.org Artificial Intelligence

May-15-2025

arXiv.org PDF

Add feedback

Country:
- North America > Canada
  - Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
- Europe > United Kingdom
  - England > Oxfordshire > Oxford (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Leisure & Entertainment > Games (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning
    - Uncertainty (0.89)
    - Optimization (0.87)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks (1.00)
    - Learning Graphical Models > Undirected Networks
      - Markov Models (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found