Learning a Pessimistic Reward Model in RLHF