Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits

Neural Information Processing Systems 

Reinforcement learning with outcome-based feedback faces a fundamental challenge: when rewards are only observed at trajectory endpoints, how do we assign credit to the right actions? This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.