Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback