Policy Optimization as Online Learning with Mediator Feedback