What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?

Shihab, Ibne Farabi, Akter, Sanjeda, Sharma, Anuj

arXiv.org Artificial Intelligence 

Sparse-reward reinforcement learning (RL) remains fundamentally hard: without structure, any agent needs Ω(|S||A|/p) samples to recover rewards. We introduce Policy-A ware Matrix Completion (P AMC) as a first concrete step toward a structural reward learning framework. Our key idea is to exploit approximate low-rank + sparse structure in the reward matrix, under policy-biased (MNAR) sampling. We prove recovery guarantees with inverse-propensity weighting, and establish a visitation-weighted error-to-regret bound linking completion error to control performance. Importantly, when assumptions weaken, P AMC degrades gracefully: confidence intervals widen and the algorithm abstains, ensuring safe fallback to exploration. Empirically, P AMC improves sample efficiency across Atari-26 (10M steps), DM Control, MetaWorld MT50, D4RL offline RL, and preference-based RL benchmarks, outperforming DrQ-v2, DreamerV3, Agent57, T -REX/D-REX, and PrefPPO under compute-normalized comparisons. Our results highlight P AMC as a practical and principled tool when structural rewards exist, and as a concrete first instantiation of a broader structural reward learning perspective. What fundamental properties of reward functions determine the sample complexity of reinforcement learning?

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found