Value Imprint: A Technique for Auditing the Human Values Embedded in RLHF Datasets
–Neural Information Processing Systems
LLMs are increasingly fine-tuned using RLHF datasets to align them with human preferences and values. However, very limited research has investigated which specific human values are operationalized through these datasets. In this paper, we introduce Value Imprint, a framework for auditing and classifying the human values embedded within RLHF datasets. To investigate the viability of this framework, we conducted three case study experiments by auditing the Anthropic/hh-rlhf, OpenAI WebGPT Comparisons, and Alpaca GPT-4-LLM datasets to examine the human values embedded within them. Our analysis involved a two-phase process.
Neural Information Processing Systems
Mar-27-2025, 08:36:45 GMT
- Country:
- North America > United States > Indiana > Tippecanoe County > West Lafayette (0.14)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (0.93)
- Government > Military (0.92)
- Health & Medicine
- Consumer Health (0.67)
- Therapeutic Area (0.93)
- Law > Civil Rights & Constitutional Law (0.68)
- Technology: