Optimistic Policy Optimization with Bandit Feedback

Open in new window