Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Open in new window