RL-finetuning LLMs from on- and off-policy data with a single algorithm