Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

Open in new window