Dataset Reset Policy Optimization for RLHF