Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference

Open in new window