Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion