Post-hoc Reward Calibration: A Case Study on Length Bias

Open in new window