Multi-Reward Best Policy Identification Filippo Vannella Ericsson AB

Jun-1-2025, 09:51:02 GMT–Neural Information Processing Systems

Rewards are a critical aspect of formulating Reinforcement Learning (RL) problems; often, one may be interested in testing multiple reward functions, or the problem may naturally involve multiple rewards. In this study, we investigate the Multi-Reward Best Policy Identification (MR-BPI) problem, where the goal is to determine the best policy for all rewards in a given set R with minimal sample complexity and a prescribed confidence level. We derive a fundamental instancespecific lower bound on the sample complexity required by any Probably Correct (PC) algorithm in this setting. This bound guides the design of an optimal exploration policy attaining minimal sample complexity. However, this lower bound involves solving a hard non-convex optimization problem. We address this challenge by devising a convex approximation, enabling the design of sample-efficient algorithms. We propose MR-NaS, a PC algorithm with competitive performance on hard-exploration tabular environments. Extending this approach to Deep RL (DRL), we also introduce DBMR-BPI, an efficient algorithm for model-free exploration in multi-reward settings.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Jun-1-2025, 09:51:02 GMT

Conferences PDF

Add feedback

Country:
- Europe > Sweden (0.14)
- North America > United States (0.14)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.87)

Industry:
- Information Technology (0.92)
- Leisure & Entertainment > Games (0.67)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning
      - Learning Graphical Models > Undirected Networks
        Markov Models (0.46)
      - Neural Networks (0.92)
      - Reinforcement Learning (1.00)
    - Representation & Reasoning > Optimization (1.00)
    - Robots (1.00)
  - Communications > Networks (1.00)