Interpretable Reward Modeling with Active Concept Bottlenecks

Open in new window