CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization

Open in new window