How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis

Open in new window