Unlocking Transparent Alignment Through Enhanced Inverse Constitutional AI for Principle Extraction

Jan-28-2025–arXiv.org Artificial Intelligence

Multiple options exist to align pre-trained Large Language Models (LLMs) to better adhere to human preferences. Popular methods include Reinforcement Learning from Human Feedback (RLHF), which trains a reward model to act as a proxy for human feedback to rate model outputs, and Direct Preference Optimization (DPO), which eliminates an explicit reward model to represent human preferences, and instead, implicitly defines this in their loss function for fine-tuning. Both approaches heavily rely on pairwise human-annotated preference data that ranks model outputs. As an alternative method to alignment, Anthropic introduced Constitutional AI (CAI) [1], which offers a rule-based alternative to alignment based on a core set of principles/values called constitution. This set contains key ethical, moral, and safety standards that guide the outputs and promote desired behaviors through repeated critiquing of model outputs. Having an explicitly defined set of core values aids in the interpretability of the changes induced through the alignment procedure, as typical approaches like DPO or RLHF rely on an implicitly defined set of principles embedded in the pairwise preference data. Building on the idea of CAI, [2] proposed an Inverse Constitutional AI (ICAI) algorithm.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Jan-28-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.83)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (0.89)
  - Representation & Reasoning > Rule-Based Reasoning (0.67)