KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF

Open in new window