TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

Open in new window