Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference