COPF: Continual Learning Human Preference through Optimal Policy Fitting

Open in new window