Policy Optimization Achieves Data-Dependent Regret Bounds in MDPs with Unknown Transitions

Open in new window