Avoiding $\mathbf{exp(R_{max})}$ scaling in RLHF through Preference-based Exploration

Open in new window