Interpretable Reward Modeling with Active Concept Bottlenecks

Laguna, Sonia, Kobalczyk, Katarzyna, Vogt, Julia E., Van der Schaar, Mihaela

Jul-22-2025–arXiv.org Artificial Intelligence

We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Jul-22-2025

arXiv.org PDF

Add feedback

Country:
- Europe
  - Switzerland (0.28)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.98)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found