COPR: Continual Human Preference Learning via Optimal Policy Regularization

Open in new window