On Reinforcement Learningand Distribution Matchingfor Fine-Tuning Language Models withno Catastrophic Forgetting

Neural Information Processing Systems 

Twoofthemcanbecharacterizedas "Reward Maximization" (RM): Standard Policy Gradients (PG) and KL-control.