Proximal Policy Optimization and its Dynamic Version for Sequence Generation