Optimistic Policy Optimization with Bandit Feedback